# Neural Network Optimization

Deep LearningModelingNeural Networksposted by Matthew Stewart April 15, 2020 Matthew Stewart

This article is the third in a series of articles aimed at demystifying neural networks and outlining how to design and implement them. In this article, I will discuss the following concepts related to the optimization of neural networks:

- Challenges with optimization
- Momentum
- Adaptive Learning Rates
- Parameter Initialization
- Batch Normalization

*[Related Article: Image Augmentation for Convolutional Neural Networks]*

All the code that is discussed in this and subsequent tutorials on the topics of (fully connected) neural networks will be accessible through my **Neural Networks** GitHub repository, which can be found at the link below.

**Challenges with Optimization**

When talking about optimization in the context of neural networks, we are discussing **non-convex optimization**.

*Convex optimization* involves a function in which there is only one optimum, corresponding to the global optimum (maximum or minimum). There is no concept of local optima for convex optimization problems, making them relatively easy to solve—these are common introductory topics in undergraduate and graduate optimization classes.

Non-convex optimization involves a function which has multiple optima, only one of which is the global optima. Depending on the loss surface, it can be very difficult to locate the global optima

For a neural network, the curve or surface that we are talking about is the loss surface. Since we are trying to minimize the prediction error of the network, we are interested in finding the global minimum on this loss surface—this is the aim of neural network training.

There are multiple problems associated with this:

**What is a reasonable learning rate to use?**Too small a learning rate takes too long to converge, and too large a learning rate will mean that the network will not converge.**How do we avoid getting stuck in local optima?**One local optimum may be surrounded by a particularly steep loss function, and it may be difficult to ‘escape’ this local optimum.**What if the loss surface morphology changes?**Even if we can find the global minimum, there is no guarantee that it will remain the global minimum indefinitely. A good example of this is when training on a dataset that is not representative of the actual data distribution—when applied to new data, the loss surface will different. This is one reason why trying to make the training and test datasets representative of the total data distribution is of such high importance. Another good example is data that habitually changes in distribution due to its dynamic nature—an example of this would be user preferences for popular music or movies, which changes day-to-day and month-to-month.

Fortunately, there are methods available that provide ways to tackle all of these challenges, thus mitigating their potentially negative ramifications.

**Local Optima**

Previously, local minima were viewed as a major problem in neural network training. Nowadays, researchers have found that when using sufficiently large neural networks, most local minima incur a low cost, and thus it is not particularly important to find the true global minimum—a local minimum with reasonably low error is acceptable.

**Saddle Points**

Recent studies indicate that in high dimensions, saddle points are more likely than local minima. Saddle points are also more problematic than local minima because close to a saddle point the gradient can be very small. Thus, gradient descent will result in negligible updates to the network and hence network training will cease.

An example function that is often used for testing the performance of optimization algorithms on saddle points is the **Rosenbrook function**. The function is described by the formula: *f(x,y) = (a-x)² + b(y-x²)², *which has a global minimum at *(x,y) = (a,a²)*.

This is a non-convex function with a global minimum located within a long and narrow valley. Finding the valley is relatively easy, but it is difficult to converge to the global minimum due to the flat valley—which thus has small gradients so it is difficult for gradient-based optimization procedures to converge.

**Poor Conditioning**

An important problem is the particular form of the error function that represents the learning problem. It has long been noted that the derivatives of the error function are usually *ill-conditioned*. This ill-conditioning is reflected in error landscapes which contain many saddle points and flat areas.

To understand this, we can look at the *Hessian matrix*—a square matrix of second-order partial derivatives of a scalar-valued function. The *Hessian* describes the local curvature of a function of many variables.

The Hessian can be used to determine whether a given stationary point is a saddle point or not. If the Hessian is indefinite at that location, that stationary point is a saddle point. This can also be reasoned in a similar way by looking at the eigenvalues.

Computing and storing the full Hessian matrix takes *O(n²)* memory, which is infeasible for high-dimensional functions such as the loss functions of neural networks. For such situations, truncated-Newton and quasi-Newton algorithms are often used. The latter family of algorithms use approximations to the Hessian; one of the most popular quasi-Newton algorithms is BFGS.

Often for neural networks, the Hessian matrix is poorly conditioned—the output changes rapidly for a small change of input. This is an undesirable property as it means that the optimization process is not particularly stable. In these environments, learning is slow despite the presence of strong gradients because oscillations slow the learning process down.

**Vanishing/Exploding Gradients**

So far we have only discussed the structure of the objective function—in this case the loss function—and its effects on the optimization process. There are additional issues associated with the architecture of the neural network, which is particularly relevant for deep learning applications.

The above structure is an example of a deep neural network with *n* hidden layers. As features at the first layer are propagated through the network, they undergo affine transformations followed by an activation function, described as follows:

The above equations are true for a single layer. We can write the output for an *n*-layer network:

Now there are two possible cases for the above formulation, depending on the magnitude of *a* and *b*.

If the values are greater than 1, for a large value of *n* (a deep neural network), the gradient values will quickly explode as they propagate through the network. Exploding gradients lead to “cliffs” unless gradient clipping is implemented (the gradient is clipped if it exceeds a certain threshold value).

If the values are less than 1, the gradients will quickly tend to zero. If computers were capable of storing infinitely small numbers then this would not be a problem, but computers only store values to a finite number of decimal places. If the gradient value becomes smaller than this value, it will just be recognized as zero.

So what are we to do? We have found that neural networks are doomed to have large numbers of local optima, often containing both sharp and flat valleys which result in the stagnation of learning and unstable learning.

I will now discuss some of the ways we can help to mitigate the issues we have just discussed regarding neural network optimization, starting with momentum.

# Momentum

One problem with stochastic gradient descent (SGD) is the presence of oscillations which result from updates not exploiting curvature information. This results in SGD being slow when there is high curvature.

By taking the average gradient, we can obtain a faster path to optimization. This helps to dampen oscillations because gradients in opposite directions get canceled out.

The name momentum comes from the fact that this is analogous to the notion of linear momentum from physics. An object that has motion (in this case it is the general direction that the optimization algorithm is moving) has some inertia which causes them to tend to move in the direction of motion. Thus, if the optimization algorithm is moving in a general direction, the momentum causes it to ‘resist’ changes in direction, which is what results in the dampening of oscillations for high curvature surfaces.

Momentum is an added term in the objective function, which is a value between 0 and 1 that increases the size of the steps taken towards the minimum by trying to jump from a local minimum. If the momentum term is large then the learning rate should be kept smaller. A large value of momentum also means that the convergence will happen fast. But if both the momentum and learning rate are kept at large values, then you might skip the minimum with a huge step. A small value of momentum cannot reliably avoid local minima, and can also slow down the training of the system. Momentum also helps in smoothing out the variations, if the gradient keeps changing direction. A right value of momentum can be either learned by hit and trial or through cross-validation.

Momentum uses past gradients for updating values, as shown in the formula below. The value *v* associated with momentum is often called the ‘velocity’. More weight is applied to more recent gradients, creating an exponentially decaying average of gradients.

We can see the effects of adding momentum on an optimization algorithm. The first few updates show no real advantage over vanilla SGD—since we have no previous gradients to utilize for our update. As the number of updates increases our momentum kickstarts and allows faster convergence.

Another type of momentum that exists is Nesterov momentum, which we will discuss briefly.

**Nesterov Momentum**

A good discussion of Nesterov momentum is given in Sutskever, Martens et al.”On the importance of initialization and momentum in deep learning” 2013.

The main difference is in classical momentum you first correct your velocity and then make a big step according to that velocity (and then repeat), but in Nesterov momentum, you first make a step into velocity direction and then make a correction to a velocity vector based on a new location (then repeat).

i.e. classical momentum:

```
vW(t+1) = momentum.*Vw(t) - scaling .* gradient_F( W(t) )
W(t+1) = W(t) + vW(t+1)
```

While Nesterov momentum is this:

```
vW(t+1) = momentum.*Vw(t) - scaling .* gradient_F( W(t) + momentum.*vW(t) )
W(t+1) = W(t) + vW(t+1)
```

The difference is subtle but in practice, it can make a huge difference.

This concept can be difficult to comprehend, so below is a visual representation of the difference between the traditional momentum update and Nesterov momentum.

**Adaptive Learning Rate**

Oscillations along vertical direction—Learning must be slower along parameter 2 Use a different learning rate for each parameter?

## AdaGrad

Momentum adds updates to the slope of our error function and speeds up SGD in turn. AdaGrad adapts updates to each individual parameter to perform larger or smaller updates depending on their importance.

One of Adagrad’s main benefits is that it eliminates the need to manually tune the learning rate and results in greater progress along gently sloped directions.

AdaGrad’s main weakness is its accumulation of the squared gradients in the denominator: Since every added term is positive, the accumulated sum keeps growing during training. This, in turn, causes the learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge.

## RMSProp

For non-convex problems, AdaGrad can prematurely decrease the learning rate. We can use an exponentially weighted average for gradient accumulation.

## Adam

Adam is a combination of RMSprop and momentum (similarly, Nadam refers to a combination of RMSprop and Nesterov momentum). Adam refers to *adaptive moment estimation, *and it is the most popular optimizer used for neural networks today.

Adam computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients vt like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients, similar to momentum.

**Parameter Initialization**

In the previous sections, we looked at how best to navigate the loss surface of the neural network objective function in order to converge to the global optimum (or an acceptable good local optimum). Now we will look at how we can manipulate the network itself in order to aid the optimization procedure.

The initialization of network weights is an important and often overlooked characteristic of developing neural networks. There are multiple issues associated with poorly initialized networks that can be detrimental to network performance.

Take for example a network that we initialize with all values of zero. What would happen in this scenario? The network would actually not learn anything at all. Even after a gradient update, all the weights would still be zero, because of the inherent way we are calculating the gradient updates.

Imagine that we implemented this and worked out it was an issue, and then decided to set our network initialization all to the same value of 0.5. What would happen now? The network would actually learn something, but we have prematurely prescribed some form of symmetry between neural units.

In general, it is a good idea to avoid presupposing any form of a neural structure by randomizing weights according to a normal distribution. This is often done in Keras by specifying a random state (which provides randomness but ensures that the measurements are repeatable).

What should be the scale of this initialization? If we choose large values for the weights, this can lead to exploding gradients. On the other hand, small values for weights can lead to vanishing gradients. There is some sweet spot that provides the optimum tradeoff between these two, but it cannot be known *a priori* and must be inferred through trial and error.

## Xavier Initialization

Xavier initialization is a simple heuristic for assigning network weights. With each passing layer, we want the variance to remain the same. This helps us keep the signal from exploding to high values or vanishing to zero. In other words, we need to initialize the weights in such a way that the variance remains the same for both the input and the output.

The weights are drawn from a distribution with zero mean and a specific variance. For a fully-connected layer with *m* inputs:

The value *m* is sometimes called the *fan-in: *the number of incoming neurons (input units in the weight tensor).

It is important to remember that this is a heuristic and therefore has no particular theoretical backing—it has merely been empirically observed to perform well. You can read the original paper here.

**He Normal Initialization**

He normal initialization is essentially the same as Xavier initialization, except that the variance is multiplied by a factor of two.

In this method, the weights are initialized keeping in mind the size of the previous layer which helps in attaining a global minimum of the cost function faster and more efficiently. The weights are still random but differ in range depending on the size of the previous layer of neurons. This provides a controlled initialization hence the faster and more efficient gradient descent.

For ReLU units, it is recommended:

## Bias Initialization

Bias initialization refers to how the biases of the neurons should be initialized. We have already described that weights should be randomly initialized with some form of normal distribution (to break symmetry), but how should we approach the biases?

To paraphrase the Stanford CS231n course: The simplest and a common way of initializing biases is to set them to zero—since the asymmetry breaking is provided by the small random numbers in the weights. For ReLU non-linearities, some people like to use small constant values such as 0.01 for all biases because this ensures that all ReLU units fire in the beginning and therefore obtain and propagate some gradient. However, it is not clear if this provides a consistent improvement and it is more common to set biases to zero.

One main concern with bias initialization is to avoid saturation at initialization within hidden units—this can be done, for example in ReLU, by initializing biases to 0.1 instead of zero.

**Pre-Initialization**

One other method of initializing weights is to use pre-initialization. This is common for convolutional networks used for examining images. The technique involves importing the weights of an already trained network (such as VGG16) and using these as the initial weights of the network to be trained.

This technique is only really viable for networks which are to be used on similar data to that which the network was trained on. For example, VGG16 was developed for image analysis, if you are planning to analyze images but have few data samples in your data set, pre-initialization might be a tenable method to utilize. This is the underlying concept behind transfer learning, but the terms pre-initialization and transfer learning are not necessarily synonymous.

# Batch Normalization

Up to this point, we have looked at ways to navigate the loss surface of the neural network using momentum and adaptive learning rates. We have also look at several methods of parameter initialization in order to minimize *a priori* biases within the network. In this section, we will look at how we can manipulate the data itself in order to aid our model optimization.

To do this, we will look at batch normalization and some of the ways it can be implemented to aid the optimization of neural networks.

**Feature Normalization**

Feature normalization is exactly what it says, it involves normalizing features before applying the learning algorithm. This involves rescaling the feature and is generally done during preprocessing.

According to the paper “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, gradient descent converges much faster with feature scaling than without it.

There are several ways to scale the data. One common method is *min-max normalization*, whereby

The simplest method to scale data is known as *min-max normalization* and involves rescaling the range of features to scale the range in [0, 1] or [−1, 1]. This is done by subtracting each value by the minimum value and then scaling this by the range of values present in the dataset. If the distribution of data is highly skewed this can result in many values clustered in one location. If this occurs it can sometimes be mitigated by taking the logarithm of the feature variable (as this has a tendency to collapse outliers so they have a less profound effect on the distribution).

Another common method is *mean normalization*, this is essentially the same as min-max normalization except the average value is subtracted from each value. This is the least common of the three discussed in this article.

*Feature standardization* makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance. This method is widely used for normalization in many machine learning algorithms (typically those that involve distance-based methods). The general method of calculation is to determine the distribution mean and standard deviation for each feature. Next, we subtract the mean from each feature. Then we divide the values (mean is already subtracted) of each feature by its standard deviation.

By performing normalization we ameliorate the distortion (such as the elongation of one feature compared to another feature) of the dataset and make it more uniform.

## Internal Covariance Shift

This idea also comes from the previously mentioned paper “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”.

The authors define *internal covariance shift*:

We define Internal Covariate Shift as the change in the distribution of network activations due to the change in network parameters during training.

This may be a bit vague, so I will try to unpack this. In neural networks, the output of the first layer feeds into the second layer, the output of the second layer feeds into the third, and so on. When the parameters of a layer change, the distribution of inputs to subsequent layers also changes.

These shifts in input distributions can be problematic for neural networks, as it has a tendency to slow down learning, especially deep neural networks that could have a large number of layers.

It is well established that networks converge faster if the inputs have been whitened (ie zero mean, unit variances) and are uncorrelated and internal covariate shift leads to just the opposite.

Batch normalization is a method intended to mitigate internal covariate shift for neural networks.

## Batch Normalization

Batch normalization is an extension to the idea of feature standardization to other layers of the neural network. If the input layer can benefit from standardization, why not the rest of the network layers?

To increase the stability of a neural network, batch normalization normalizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation.

Batch normalization allows each layer of a network to learn by itself more independently of other layers.

However, after this shift/scale of activation outputs by some randomly initialized parameters, the weights in the next layer are no longer optimal. SGD ( Stochastic gradient descent) undoes this normalization if it’s a way for it to minimize the loss function.

Consequently, batch normalization adds two trainable parameters to each layer, so the normalized output is multiplied by a “standard deviation” parameter (γ) and add a “mean” parameter (β). In other words, batch normalization lets SGD do the denormalization by changing only these two weights for each activation, instead of losing the stability of the network by changing all the weights.

This procedure is known as the *batch normalization transform*.

To illustrate this visually, we can analyze the below figure. We are looking at the first hidden layer, immediately following the input layer. For each of the *N* mini-batches, we can calculate the mean and standard deviation of the output.

This is subsequently repeated for all subsequent hidden layers. Following this, we can differentiate the joint loss for the *N* mini-batches and then back-propagate through the normalization operations.

Batch normalization reduces overfitting because it has a slight regularization effect. Similar to *dropout*, it adds some noise to each hidden layer’s activations.

During test time, the mean and standard deviations are replaced with running averages collected during training time. This is the same as using the population statistics instead of mini-batch statistics as this ensures that the output deterministically depends on the input.

There are several advantages to using batch normalization:

- Reduces internal covariant shift.
- Reduces the dependence of gradients on the scale of the parameters or their initial values.
- Regularizes the model and reduces the need for dropout, photometric distortions, local response normalization and other regularization techniques.
- Allows use of saturating nonlinearities and higher learning rates.

# Final Comments

*[Related Article: Adversarial Attacks on Deep Neural Networks]*

This concludes the third part of my series of articles about fully connected neural networks. In the next articles, I will provide some in-depth coded examples demonstrating how to perform neural network optimization, as well as more advanced topics for neural networks such as warm restarts, snapshot ensembles, and more.

*Originally Posted Here*