Exploit Your Hyperparameters: Batch Size and Learning Rate as Regularization Exploit Your Hyperparameters: Batch Size and Learning Rate as Regularization
Gradient descent is one of the first concepts many learn when studying machine or deep learning. This optimization algorithm underlies most of... Exploit Your Hyperparameters: Batch Size and Learning Rate as Regularization

Gradient descent is one of the first concepts many learn when studying machine or deep learning. This optimization algorithm underlies most of machine learning, including backpropagation in neural networks. When learning gradient descent, we learn that learning rate and batch size matter.

Specifically, increasing the learning rate speeds up the learning of your model, yet risks overshooting its minimum loss. Reducing batch size means your model uses fewer samples to calculate the loss in each iteration of learning.

Beyond that, these precious hyperparameters receive little attention. We tune them to minimize our training loss. Then use “more advanced” regularization approaches to improve our models, reducing overfitting. Is that the right approach?

Reimagining Gradient Descent

Let’s take a step back. Let’s say we want the absolute minimum loss. If we have unlimited computing power, what would we do with our learning rate? We’d probably reduce it to get to the absolute minimum. However, unlike models like linear regression, neural networks are not convex. Rather than look like a bowl, they look like a mountain range:

Photo by Sergey Pesterev on Unsplash

Reducing your learning rate guarantees you get deeper into one of those low points, but it will not stop you from dropping into a random sub-optimal hole. This is a local minimum or a point that looks like the lowest point, but it is not. And it likely overfits to your training data, meaning it will not generalize to the real world.

What we want is to get into one of those smoother plateaus. Neural networks with flatter minima have greater generalizability (plus benefits like preventing attacks on your network).

An intuitive way to think about this is that a narrow hole is likely specific to your training data. Slight changes to your data (i.e. moving in any direction in this space) will result in major changes to the loss, while a flatter terrain will be less sensitive. That is where regularization comes in: techniques to prevent our model from getting stuck in these narrow, deep minima.

There are many regularization techniques like dropout, dataset augmentation, and distillation. However, why add to our toolbox before making the best use of the tools we have? Our tools that are always present, learning rate and batch size, can perform a degree of regularization for us.

Finding That Broad Minimum

As the previous example showed, reducing the learning rate to a very low number makes it easy to get stuck in a local minimum that may be overfit.

By increasing the learning rate, we achieve the rarely discussed benefit of allowing our model to get out of minima that overfit. The model will miss local minima and find a broader, flatter minimum.

We still need to ensure the learning rate is not too large, potentially leveraging techniques like adaptive learning rates. Yet ultimately, we want to ensure a small learning rate is not making our model look good in training and bad everywhere else.

Similarly, reducing the batch size adds more noise to convergence. Smaller samples have more variation from one another, so the convergence rate and direction on the above terrain is more variable. As a result, the model is more likely to find broader local minima. This contrasts with taking a large batch size, or even all the sample data, which results in smooth converge to a deep, local minimum. Hence, a smaller batch size can provide implicit regularization for your model.


There has been plenty of research into regularization techniques for neural networks. Researchers have even questioned whether such techniques are necessary, since neural networks seem to show implicit regularization. Yet, before applying other regularization steps, we can reimagine the role of learning rate and batch size. Doing so can reduce overfitting to create better, simpler models.

Higher learning rates and lower batch sizes can prevent our models from getting stuck in deep, narrow minima. As a result, they will be more robust to changes between our training data and real-world data, performing better where we need them to.

Article originally posted here. Reposted with permission.

About the Author: David Yastremsky is a system software engineer, focused on simplifying the deployment of deep learning models. Having transitioned to AI from consulting and running an impact-focused start-up, he writes to make cutting-edge research accessible and clear for all. He actively contributes to a number of leading publications, including Towards Data Science and Dev Genius. In his spare time, he sings, plays kickball, and hikes around Seattle.

LinkedIn | GitHub | Medium

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.