Are you ready to level up your optimization game?

Let's talk about optimizers – the steering wheel for your model's learning process. Choosing the right one can mean the difference between aimlessly wandering and smoothly sailing toward your prediction goals.

In this post, I'll give you an intuitive explanation of 3 popular optimizers: SGD, Adam, and RMSProp.

Let's start with SGD...

Stochastic Gradient Descent (SGD)

SGD is a widely-used optimization algorithm in machine learning.

It was first introduced in a 1950s paper by Herbert Robbins and Sutton Monro called "A Stochastic Approximation Method." The algorithm updates the model weights using the loss gradient for the weights computed using a single training example. It uses random samples of the data to determine the direction of the descent, which makes updates more accurate and efficient.

But how does it work?

Here's an intuitive explanation of SGD, step-by-step:

You start by setting your model's parameters (weights and biases) to random or predetermined values. Then, you'll go through each training example and:

Calculate your model's prediction error (loss) on the example.
Compute the gradient of the loss concerning the model parameters.
Update your model parameters in the opposite direction of the gradient, using a step size (learning rate) to control the update size.

Imagine you're a chef trying to find the perfect recipe for cookies.

Alt text

You start with a basic recipe and make small batches of cookies to taste-test them. If the cookies are too sweet, you reduce the sugar. If they aren't sweet enough, you add more sugar. You continue to tweak the recipe until you find the perfect balance of ingredients. SGD works similarly. It starts with a set of initial model parameters and makes minor adjustments based on how well the model is doing on a training example.

It repeats the process until the model parameters converge to a satisfactory solution.

Now, let's move on to Adam and RMSProp – two more popular optimizers that are computationally intensive but often converge faster.

RMSProp

Alt text

Let's dive into RMSProp – an optimization algorithm that was introduced by Geoff Hinton in 2012.

It's an extension of the famous stochastic gradient descent (SGD) algorithm with an added twist.

The key idea behind RMSProp is to scale the gradient of each weight in the model by dividing it by the root mean square (RMS) of the gradients of that weight. This helps prevent weights with large gradients from learning too quickly while allowing weights with small gradients to continue learning faster.

The result is a more stable and effective training process.

Here's an intuitive explanation of RMSProp, step-by-step:

Initialize the RMSProp state variable. This state will store the moving average of the squared gradients.
Compute the gradient of the loss function with respect to each weight in the model.
Update the RMSProp state variables by calculating the moving average of the squared gradients.
Scale the gradients by dividing them by the squared root of the RMSProp state (e.g., moving average of the squared gradients). This helps prevent weights with large gradients from learning too quickly while allowing weights with small gradients to continue learning faster.
Update the model parameters using the scaled gradients and a learning rate.

Repeat steps 2-5 for a predetermined number of iterations or until the model reaches convergence.

Another way to think of RMSProp is as adding "friction" to the training process.

Alt text

Imagine that you're pushing a box across a floor - if the floor is very smooth, the package will continue to move even after you stop trying. But if you add friction to the floor (for example, by putting a rug down), the box will eventually stop. In the same way, RMSProp can add "friction" to the training process by decaying the sum of the previously squared gradients.

Some use cases where RMSProp may be a good choice include:

If you are training a model with many parameters and are experiencing issues with the model diverging or oscillating during training, RMSProp can help stabilize the training process by adjusting to the gradient.
If the learning rate is hard to tune, RMSProp can be effective because it scales the gradients, which can help the optimization process converge more smoothly regardless of the learning rate.
If you train a model on a noisy or irregularly-shaped loss function, RMSProp can smooth out short-term fluctuations and highlight long-term trends which help mitigate the effects of noise and allow the model to converge more quickly.
When using RMSProp, it's also important to keep an eye on several key hyperparameters such as the learning rate, batch size, decay rate, and epsilon. Adjusting these factors can help fine-tune the optimization process and achieve the best results.

That's it for RMSProp – another powerful optimization algorithm to add to your toolbox!

Adam

Alt text

Let's dive into Adam – an optimization algorithm that was introduced in 2015 by Diederik Kingma and Jimmy Ba.

It's a stochastic gradient descent optimization algorithm that uses an adaptive learning rate based on estimates of the first and second moments.

Adam stands for** Adaptive Moment Estimation** and maintains exponential moving averages of the weights and gradients, which it uses to scale the learning rate. This allows the learning rate to be adjusted on the fly based on the model's current state rather than a fixed value. This can improve the speed and stability of the optimization process.

Here's an intuitive explanation of Adam, step-by-step:

Initialize the model weights with some starting values.
For each training iteration:
1. Calculate the gradient of the loss function with respect to the model weights.
2. Calculate the exponential moving average of the gradients and the exponential moving average of the squared gradients.
3. Use these moving averages to adjust the learning rate for the current iteration.
4. Update the model weights using the adjusted learning rate and the gradients calculated in step 2.
Repeat steps 2 and 3 until the model has converged or the maximum number of iterations has been reached.

Adam is like a GPS for your car that helps you find the optimal gas pedal setting (learning rate) by continuously tracking the car's speed and the curvature of the road (the gradient of the loss function with respect to the model weights). This adjusts the gas pedal (learning rate) to help you find the fastest and smoothest path to your destination.

Here are a few heuristics or use cases for selecting Adam as your optimizer:

When you want a fast and efficient optimization algorithm: Adam requires relatively little memory and computation, making it a fast and efficient choice for training deep learning models.
When you have noisy or sparse gradients: Adam is well-suited for optimizing models with noisy or sparse gradients, as it can use the additional information provided by the gradients to adapt the learning rate on the fly.
When you want to try a "plug-and-play" optimization algorithm: Adam, a "plug-and-play" optimization algorithm, requires relatively little tuning. It is a good choice if you want to train your model quickly.

It's worth noting that Adam may only sometimes be the best choice, and it's always worth experimenting to see if you can get better results with a different optimizer.

When using Adam, it's also important to keep an eye on several key hyperparameters such as the learning rate, epsilon, batch size, Beta1, and Beta2.

Adjusting these factors can help fine-tune the optimization process and achieve the best results.

See them in action

Here's a link to a Kaggle notebook where you'll see the impact of optimizers on an image classification task.

Note: I'll use a PyTorch-based training library called SuperGradients.

It's an open-source project that has a robust model zoo of pretrained SOTA models for all the major computer vision tasks, reduces your development time by abstracting away a lot of boilerplate code, and has some awesome training tricks that you can use right out of the box.

Check out the repo and give the project a star.

https://www.kaggle.com/code/harpdeci/intuitive-explanation-of-sgd-adam-and-rmsprop

Breaking Down SGD, Adam, and RMSProp: An Intuitive Explanation

No math, just easy-to-understand English analogies and a coding example

Table of contents

Stochastic Gradient Descent (SGD)

Here's an intuitive explanation of SGD, step-by-step:

RMSProp

Here's an intuitive explanation of RMSProp, step-by-step:

Some use cases where RMSProp may be a good choice include:

Adam

Here's an intuitive explanation of Adam, step-by-step:

Here are a few heuristics or use cases for selecting Adam as your optimizer:

See them in action