The ResNet Revolution: How One Paper Improved Deep Learning

Photo by JJ Ying on Unsplash

The ResNet Revolution: How One Paper Improved Deep Learning

The secrets to this deep learning paper's success: how it took on two of the field's most daunting challenges

In December 2015, a paper was published that rocked the deep learning world.

This paper is one of the most influential in deep learning. It's been cited over 110,000 times. The name of this paper is Deep Residual Learning for Image Recognition (the ResNet paper).

This paper showed the deep learning community that you can make increasingly deeper network architectures that will either perform well or be the same as the shallower networks. When AlexNet came out in 2012, most people said you need more layers to have better results. VGGNet, GoogleNet, and others have proven this.

This set the deep learning community on a quest to go deeper.

Researchers found that learning networks are not as easy as adding layers. Most importantly, training errors would increase as you add layers to an already deep network.

This was due to two problems:

1) Vanishing/exploding gradients

2) The degradation problem

Vanishing/exploding gradients

The chain rule causes the vanishing/exploding gradients problem. The chain rule makes errors grow for weights in the network. If you multiply a lot of values that are less than one, the values will get smaller and smaller. As the error gradients get closer to the layers earlier in the network, they'll be close to zero.

This results in smaller and smaller updates to earlier layers (not much learning happening).

The inverse problem is called the exploding gradient problem. This happens because there are a lot of large error gradients during training. This causes model weights in the early layers to be updated a lot.

The reason for this issue?

There's another problem, though. When the cost function is close to the early layers, the parameters in those layers are more likely to have a strong effect on the cost function.

This makes it more difficult to train those early layers.

The degradation problem

Adding more and more layers to these deep models leads to higher training errors, ending in a degradation in the expressive power of your network.

The degradation problem is unexpected because its not caused by overfitting. Researchers were finding that as networks got deeper, the training loss would decrease but then shoot back up as more layers were added to the networks. This is counterintuitive because you’d expect your training error to decrease, converge, and plateau out as the number of layers in your network increases.

Let’s imagine that you had a shallow network that was performing well.

If you take a “shallow” network and just stack on more layers to create a deeper network, the performance of the deeper network should be at least as good as the shallow network. Why? Because, in theory, a deeper network could learn the shallow network. The shallow network is a subset of the deeper network. But this doesn’t happen in practice!

You could even set the new stacked layers to be identity layers, and still, find your training error getting worse when you stack more layers on top of a shallower model. Deeper networks lead to higher training errors!

Both of these issues — the vanishing/exploding gradients and degradation problems — threatened to halt the progress of deep neural networks, until the ResNet paper came out.

The ResNet paper introduced a novel solution to these two pesky problems that plague architects of deep neural networks.

The skip connection

Image Source: Sachin Joglekar

Skip connections, which are housed in residual blocks, allow you to take the activation value from an earlier layer and pass it to a deeper layer in a network. Skip connections enable deep networks to learn the identity function. Learning the identity function allows a deeper layer to perform as well as an earlier layer, or at the very least it won’t perform any worse. The result is a smoother gradient flow, making sure important features are preserved in the training process.

The invention of the skip connection has given us the ability to build deeper and deeper networks while avoiding the problem of vanishing/exploding gradients and degradation.

Here’s how it works…

Demonstrating information flow in a plain network. Source: EasyLearn.AI

Instead of the output of the previous layer being passed directly onto the next block, a copy of that output is made; then that copy is passed through a residual block. This residual block will process the copied output matrix, X — with a 3x3 convolution, followed by BatchNorm and ReLU to yield a matrix Z.

Then, X and Z would be added together, element by element, to yield the output to the next layer/block.

Demonstrating information flow in a residual network. Source: EasyLearn.AI

Doing this helps us make sure that any added layers in a neural network are useful for learning. The worst-case scenario is that the residual block could output a bunch of zeros, which would leave the X+Z matrix to end up being equal to X, since X+Z = X if Z is just the zero matrix.

ResNet in action!

Now it’s time to see ResNet in action.

You could train ResNet from scratch, use it on ImageNet, and find the best training parameters yourself. But why do that when you can use something pre-trained? SuperGradients gives you a ResNet model with good training parameters that you can use with minimal configuration!

This tutorial will show you how to do image classification with SuperGradients and MiniPlaces. SuperGradients is a PyTorch-based training library with models for classification, detection, and segmentation tasks.

You can follow along here and open up this notebook on Google Colab to get hands-on: