Written by Sara Bozyel
Many people may be using optimizers while training the neural network without knowing that the method is called optimization. Optimizers are algorithms or methods used to change the attributes of your neural network such as weights and learning rate to reduce the losses (2). How you should change the weights or learning rates of your neural network to reduce the losses as defined by the optimizers you use (2). Optimization algorithms or strategies are responsible for reducing losses and ensuring the most accurate results possible. But what are the different types of optimization?
Gradient Descent
Gradient Descent is the most basic but most used optimization algorithm. It is extensively used in linear regression and classification algorithms (2). Backpropagation in neural networks also uses a gradient descent algorithm (2).
Gradient descent is a first-order optimization algorithm that depends on the first-order derivative of a loss function (2). Calculates in which direction the weights must be changed for the function to reach a minimum. With backpropagation, the loss is transferred from one layer to another and the parameters of the model, also known as weight, are changed depending on the losses so that the loss can be minimized (2).
algorithm: θ=θ−α⋅∇J(θ)
Stochastic Gradient Descent
It is a variant of Gradient Descent. It tries to update the parameters of the model more often. In this, the model parameters are modified after calculating the loss in each training sample. So if the dataset contains 1000 rows, SGD will update the model parameters 1000 times in a dataset cycle instead of once as in Gradient Descent (2).
θ=θ−α⋅∇J(θ;x(i);y(i)) , where {x(i) ,y(i)} are the training examples.
Since the model parameters are updated frequently, the parameters have high variance and fluctuations in the loss functions at different densities (2).
Mini-Batch Gradient Descent
It is the best among all variations of gradient descent algorithms. It is an improvement over both SGD and standard gradient descent (1). It updates the model parameters after each batch. Thus, the data set is divided into several batches and the parameters are updated after each batch (1).
θ=θ−α⋅∇J(θ; B(i)), where {B(i)} are the batches of training examples.
Momentum
Momentum was invented to reduce the high variance in SGD and smooth the convergence. It accelerates the convergence toward the relevant direction and reduces the fluctuation toward the irrelevant direction (1). Another hyperparameter is used in this method, known as momentum, symbolized by ‘γ’ (1).
V(t)=γV(t−1)+α.∇J(θ)
Now, the weights are updated by θ=θ−V(t).
The momentum term γ is usually set to 0.9 or a similar value (1).
Nesterov Accelerated Gradient
Momentum may be a good method, but if the momentum is too high, the algorithm may miss the local minimums and keep rising (1). Hence, the NAG algorithm was developed to solve this problem (1). It is a forward-looking method. We know that we will use γV(t−1) to change the weights, so θ−γV(t−1) tells us approximately the future position (1).
Adagrad
One of the drawbacks of all the optimizers described is that the learning rate is constant for all parameters and each cycle (1). This optimizer modifies the learning rate. It changes the learning rate 'η' for each parameter and at each 't' step (1). This is a kind of quadratic optimization algorithm. It works on the derivative of an error function (1).
A derivative of loss function for given parameters at a given time t.
Update parameters for given input I and at time/iteration t
η is a learning rate modified for a given parameter θ(i) at a given time, based on previous gradients calculated for the given parameter θ(i).
Calculate the sum of the squares of the gradients w.r.t. θ(i) is a smoothing term (usually on the order of 1e−8) that prevents division by zero, and ϵ up to time step t. Interestingly, the algorithm performs much worse without the square root operation (1).
It makes big updates for less frequently used parameters and a small step for frequently used parameters (1).
AdaDelta
It is an extension of AdaGrad that tends to eliminate the decaying learning Rate problem (2). Instead of summing all previously squared gradients, Adadelta limits the window of accumulated historical gradients to some fixed size w. In this exponentially moving mean, it is used instead of the sum of all gradients (2).
E[g²](t)=γ.E[g²](t−1)+(1−γ).g²(t)
We set γ to a similar value as the momentum term, around 0.9.
Update the parameters
Adam
Adam (Adaptive Moment Estimation) works with first and second-order momentums (2). The intuition behind Adam is that we don't want to roll too fast as we can jump over the minimum, we want to slow down the speed a bit for careful searching. In addition to storing an exponentially decreasing mean of past gradients like Adam, AdaDelta, M(t) also keeps an exponentially decreasing mean of past gradients (2).
M(t) and V(t) are values of the first moment which is the Mean and the second moment which is the uncentered variance of the gradients respectively.
First and second order of momentum
Here, we are taking the mean of M(t) and V(t) so that E[m(t)] can be equal to E[g(t)] where E[f(x)] is an expected value of f(x).
To update the parameter:
Update the parameters
The values for β1 is 0.9 , 0.999 for β2, and (10 x exp(-8)) for ‘ϵ’.
Adam is the best optimizer. If one wants to train the neural network in less time and more efficiently than Adam is the optimizer. For sparse data use the optimizers with dynamic learning rate. If, want to use a gradient descent algorithm then min-batch gradient descent is the best option.
References:
Chen, J., & Liu, Y. (n.d.). Neural optimization machine: A neural network approach for optimization. arXiv.org. https://arxiv.org/abs/2208.03897.
Doshi, S. (2020, August 3). Various optimization algorithms for training neural network. Medium. https://towardsdatascience.com/optimizers-for-training-neural-network-59450d71caf6.