How to set adaptive learning rate for GradientDescentOptimizer

Question

I am using TensorFlow to train a neural network  This is how I am initializing the GradientDescentOptimizer   init   tf initialize all variables   sess   tf Session   sess run init   mse          tf reduce mean tf square out - out    train step   tf train GradientDescentOptimizer 0 3  minimize mse    The thing here is that I don t know how to set an update rule for the learning rate or a decay value for that    How can I use an adaptive learning rate here

User · Answer

Gradient descent algorithm uses the constant learning rate which you can provide in during the initialization. You can pass various learning rates in a way showed by Mrry.

But instead of it you can also use more advanced optimizers which have faster convergence rate and adapts to the situation.

Here is a brief explanation based on my understanding:

momentum helps SGD to navigate along the relevant directions and softens the oscillations in the irrelevant. It simply adds a fraction of the direction of the previous step to a current step. This achieves amplification of speed in the correct dirrection and softens oscillation in wrong directions. This fraction is usually in the (0, 1) range. It also makes sense to use adaptive momentum. In the beginning of learning a big momentum will only hinder your progress, so it makse sense to use something like 0.01 and once all the high gradients disappeared you can use a bigger momentom. There is one problem with momentum: when we are very close to the goal, our momentum in most of the cases is very high and it does not know that it should slow down. This can cause it to miss or oscillate around the minima
nesterov accelerated gradient overcomes this problem by starting to slow down early. In momentum we first compute gradient and then make a jump in that direction amplified by whatever momentum we had previously. NAG does the same thing but in another order: at first we make a big jump based on our stored information, and then we calculate the gradient and make a small correction. This seemingly irrelevant change gives significant practical speedups.
AdaGrad or adaptive gradient allows the learning rate to adapt based on parameters. It performs larger updates for infrequent parameters and smaller updates for frequent one. Because of this it is well suited for sparse data (NLP or image recognition). Another advantage is that it basically illiminates the need to tune the learning rate. Each parameter has its own learning rate and due to the peculiarities of the algorithm the learning rate is monotonically decreasing. This causes the biggest problem: at some point of time the learning rate is so small that the system stops learning
AdaDelta resolves the problem of monotonically decreasing learning rate in AdaGrad. In AdaGrad the learning rate was calculated approximately as one divided by the sum of square roots. At each stage you add another square root to the sum, which causes denominator to constantly decrease. In AdaDelta instead of summing all past square roots it uses sliding window which allows the sum to decrease. RMSprop is very similar to AdaDelta
Adam or adaptive momentum is an algorithm similar to AdaDelta. But in addition to storing learning rates for each of the parameters it also stores momentum changes for each of them separately

A few visualizations:

User · Answer

Tensorflow provides an op to automatically apply an exponential decay to a learning rate tensor  tf train exponential decay   For an example of it in use  see this line in the MNIST convolutional model example   Then use  mrry s suggestion above to supply this variable as the learning rate parameter to your optimizer of choice   The key excerpt to look at is     Optimizer  set up a variable that s incremented once per batch and   controls the learning rate decay  batch   tf Variable 0   learning rate   tf train exponential decay    0 01                   Base learning rate    batch   BATCH SIZE     Current index into the dataset    train size             Decay step    0 95                   Decay rate    staircase True    Use simple momentum for the optimization  optimizer   tf train MomentumOptimizer learning rate                                       0 9  minimize loss                                                     global step batch    Note the global step batch parameter to minimize   That tells the optimizer to helpfully increment the  batch  parameter for you every time it trains

User · Answer

If you want to set specific learning rates for intervals of epochs like  0  lt  a  lt  b  lt  c  lt       Then you can define your learning rate as a conditional tensor  conditional on the global step  and feed this as normal to the optimiser    You could achieve this with a bunch of nested tf cond statements  but its easier to build the tensor recursively   def make learning rate tensor reduction steps  learning rates  global step       assert len reduction steps    1    len learning rates      if len reduction steps     1          return tf cond              global step  lt  reduction steps 0               lambda  learning rates 0               lambda  learning rates 1                else          return tf cond              global step  lt  reduction steps 0               lambda  learning rates 0               lambda  make learning rate tensor                  reduction steps 1                    learning rates 1                    global step                   Then to use it you need to know how many training steps there are in a single epoch  so that we can use the global step to switch at the right time  and finally define the epochs and learning rates you want  So if I want the learning rates  0 1  0 01  0 001  0 0001  during the epoch intervals of  0  19    20  59    60  99    100   infty  respectively  I would do   global step   tf train get or create global step   learning rates    0 1  0 01  0 001  0 0001  steps per epoch   225 epochs to switch at    20  60  100  epochs to switch at    x steps per epoch for x in epochs to switch at   learning rate   make learning rate tensor epochs to switch at   learning rates  global step

User · Answer

From tensorflow official docs  global step   tf Variable 0  trainable False  starter learning rate   0 1 learning rate   tf train exponential decay starter learning rate  global step                                         100000  0 96  staircase True     Passing global step to minimize   will increment it at each step  learning step     tf train GradientDescentOptimizer learning rate   minimize    my loss     global step global step

User · Answer

First of all  tf train GradientDescentOptimizer is designed to use a constant learning rate for all variables in all steps  TensorFlow also provides out-of-the-box adaptive optimizers including the tf train AdagradOptimizer and the tf train AdamOptimizer  and these can be used as drop-in replacements   However  if you want to control the learning rate with otherwise-vanilla gradient descent  you can take advantage of the fact that the learning rate argument to the tf train GradientDescentOptimizer constructor can be a Tensor object  This allows you to compute a different value for the learning rate in each step  for example   learning rate   tf placeholder tf float32  shape           train step   tf train GradientDescentOptimizer      learning rate learning rate  minimize mse   sess   tf Session      Feed different values for learning rate to each training step  sess run train step  feed dict  learning rate  0 1   sess run train step  feed dict  learning rate  0 1   sess run train step  feed dict  learning rate  0 01   sess run train step  feed dict  learning rate  0 01     Alternatively  you could create a scalar tf Variable that holds the learning rate  and assign it each time you want to change the learning rate

[python] How to set adaptive learning rate for GradientDescentOptimizer?

Examples related to python

Examples related to tensorflow