Intuition about Machine Learning Optimizers

5 min readAug 28, 2021

Today we will be talking about the various optimizers used in machine learning and why each one of them work.

Optimizers in machine learning are those that help your model to learn. They help in finding the values of parameters such that your loss function is at its lowest, but these optimizers have no clue of the path they take, more like they are blindfolded. They might get stuck sometimes in flat terrains and not move any further. This is why it is important to know how and why each optimizers work and this article should help you in finding the right one for your project .

Gradient Descent

The original optimizer, the gradient descent works by taking small steps and finding the optimal theta, by making use of the gradients of the cost function. But but.. the update made to the theta is very rare, what I mean is the update is made only after seeing the entire dataset and sometimes(nope, most of the times) , you will overshoot and cross your local minima . Also, if the gradients are very small, learning will be extremely slow and you might not even reach the minima. Pain + Pain = Pain 😑

Stochastic Gradient Descent

This can considered as an enhancement to the previous optimizer where in this case, update is very frequent, an update is made after seeing every single data point in the dataset. Something is weird about this though, hmm it seems like it won’t be very stable? If we have lot of different types of images in the dataset, it will make a lot of noisy jumps and cross the minima again !

See what I’m talking about, the really weird left and right oscillations.

Mini Batch Gradient Descent

In mini batch GD, updates are made according to small batches that are send for training and this makes the training process more faster and easier to converge. But we have to make sure that each batch contains examples of different data points , else the same features will be learned and a lot of noise will be created again i.e the variance in the update of the weights will be very uneven.

Wayyy lesser noise compared to the previous optimizer, much better 😏

Momentum

Now sometimes, the model will change more in one direction than the others , mainly when the training examples follow a pattern. So introducing a momentum term will help the model learn faster by paying more attention to similar images and making only small updates for the other different images. But if you think about it , too much of momentum should blast you out of the local minima and into the outer space 😬

Momentum + Acceleration

In this case, the same thing happens but with a small change, the model gains momentum with every passing similar training images and the weights become larger but in this case, as soon as it encounters a different image, the momentum terms pays less attention to it and decreases the loss and this is where the update to the weights decelerates and stops it from overshooting.

Adagrad

This is more like an adaptive algorithm where you have different learning rates for different kind of examples where there is a small update to the learning rate for frequently occurring examples and a large one for the examples that occur rarely. Due to the large updates for the less occurring examples, even if some data points are missing, the model still works great. It makes use of the squared terms of the past gradients . The problem here is that the G term, the sum of squares of the past gradients is monotonically increasing and at one point, the learning rate becomes so small that learning does not happen anymore.

Adadelta

This solves the problem of Adagrad by adding a gamma term, that reduces the effect of the previous squared gradient values . Now the denominator does not explode and the learning continues.

Adam

This is more like adadelta + momentum where you can have momentum updates for every parameter. This can be done by adding the expected values of the past gradients which works more like momentum, where you start slowly and pick up speed as you go. Due its extremely fast convergence and accuracy, Adam is widely used as a default for any machine learning problem.

Now the list goes on and despite the huge number of optimizer that is available out there, all the optimizers are more like build on top of each other with a additional term to perform one better than other.

I hope you find the right one 🥂