Understanding Gradient descent is important because it’s the most commonly used optimization method deployed in machine learning and deep learning algorithms. It’s used to train a machine learning model and is based on a convex function.
Through an iterative process, gradient descent refines a set of parameters through use of partial differential equations, or PDEs. It does this to minimize a given cost function to its local minimum. Gradient descent was invented by French mathematician Louis Augustin Cauchy in 1847.
[Related Article: A Stochastic Gradient Descent Implementation in Clojure]
Most machine learning and deep learning algorithms involve some sort of optimization. Optimization refers to the process of either minimizing or maximizing some function by altering its parameters.
With gradient descent, you start with a cost function (also known as a loss or error function) based on a set of parameters. The goal is to find the parameter values that minimize the cost function. The process starts by guessing some initial parameter values. Then you iteratively change the parameter values in such a way so as to reduce the cost function. Hopefully, the process ends with a minimum.
The process of changing the parameter values involves differential Calculus, specifically calculating the “derivative” of the cost function. The derivative gives the slope of the function at a specific point. In other words, it specifies how to scale a small change in the input to obtain the corresponding change in the output. The derivative is therefore useful for minimizing the cost function because it tells us how to change the parameters in order to make a small improvement in finding the function’s minimum.
The commonly used analogy is hiking down a hill from an initial starting point, while choosing a direction to advance using small steps along the way toward a minimum point. The gradient descent process uses the derivatives of the cost function to follow the function downhill to a minimum. The figure below illustrates the step-by-step gradient descent process.
The learning rate is a positive scalar value that determines the size of each step in the gradient descent process. If the learning rate is too small, the gradient descent process can be slow. Whereas if the learning rate is too large, gradient descent can overshoot the minimum and may fail to converge, or even diverge. Gradient descent can converge to a local minimum even with a fixed learning rate. As we approach a local minimum, gradient descent will automatically take smaller steps so there is no need to decrease the learning rate over time.
Credit: Stanford CS229 Course Notes. Trajectory for gradient descent is like climbing down into a valley
3 Types of Gradient Descent
There are three primary types of gradient descent used in modern machine learning and deep learning algorithms. The main reason for these variations is computational efficiency. A data set may have millions or even billions of data points, and calculating the gradient over the entire data set can be computationally expensive.
Let’s review the different types here:
Batch Gradient Descent
Batch Gradient Descent is the most straightforward type. It calculates the error for each example within the training set. After it evaluates all training examples, it updates the model parameters. This process is often referred to as a training epoch. Advantages of batch gradient descent are that it’s computationally efficient and produces a stable error gradient and a stable convergence. One disadvantage is that the stable error gradient can sometimes result in a state of convergence that isn’t the best the model can achieve. It also requires that the entire training set resides in memory and is available to the algorithm.
Stochastic Gradient Descent
Stochastic Gradient Descent updates the parameters according to the gradient of the error with respect to a single training example. This is unlike Batch Gradient Descent, which updates the parameters after all training examples have been evaluated. This can make Stochastic Gradient Descent faster than Batch Gradient Descent depending on the problem. One advantage is that the frequent updates provide a detailed rate of improvement. A disadvantage is that the frequent updates are more computationally expensive than Batch Gradient Descent. The frequency of the updates also can result in noisy gradients, and may cause the error rate to fluctuate instead of slowly decrease.
Mini Batch Gradient Descent
Mini Batch Gradient Descent is an often-preferred method since it uses a combination of Stochastic Gradient Descent and Batch Gradient Descent. It simply separates the training set into small batches and performs an update for each of these batches. It thus creates a balance between the efficiency of Batch Gradient Descent and the robustness of Stochastic Gradient Descent. Common numbers of examples per batch range between 30 and 500. But like for any other machine learning technique, there is no well-defined rule because the optimal number can vary for different problems. Mini Batch Gradient Descent is commonly used for deep learning problems.
[Related Article: How Product Managers Learn About AI Meeting Peak Effectiveness]
This article should give you the basic motivation for the gradient descent process in machine learning. Moving forward, in order to understand the mathematical foundations of gradient descent, I strongly recommend the Stanford CS229: Machine learning course notes by Andrew Ng, and Ron Dror.