Neural networks are notoriously tricky to optimize. There isn’t a way to compute a global optimum for weight parameters, so we’re left fishing around in the dark for acceptable solutions while trying to ensure we don’t overfit the data.
This is a quick overview of the most popular approaches to model regularization with neural networks. While all of these methods are heuristic by nature, they tend to perform well in the wild and have yielded good results in practical experiments.
It’s up to data scientists to test and determine what combination of methods will get the best results. Stirring a few techniques together in the same model can get you a lot further than your vanilla unregularized model.
Dropout has been around since a seminal paper introduced the idea in 2014. In a few years it has become one of the most common ways to regularize neural networks.
The technique is incredibly straightforward: For each node of probability p, don’t update its input or output weights during backpropagation. That’s it.
This prevents nodes from becoming too “spiky” — in other words, weighing any given input from the previous layer too heavily. Instead, the surrounding layers learn to use the outputs of the other nodes, smoothing out the weights across all nodes.
L1, L2 and Elastic Norm Regularization
Norm regularization is a penalty imposed on the model’s objective function for using weights that are too large. This is done by adding an extra term onto the function: 1⁄2λw2 for the L2 norm, and λ|w| for L1. In these expressions, λ is a hyperparameter that controls the degree of regularization in the model. As in any classic regularization setup, adding this extra term will induce the model to balance the loss of its output against the magnitude of its weights.
L2 encourages the model to use all of its inputs without leaning too heavily on any one. L1 is generally better if you expect the model to use certain inputs more heavily than others. L2 is more general-purpose and shows up more often as a result.
The third option is the elastic-net regularization penalty, which is just a combination of both penalties: λ(α · 1⁄2w2 + (1 – α) · |w|), where α controls the balance between the two terms. In the original paper, elastic-net is most useful when the number of predictors is significantly larger than the number of observations.
A third intuitive approach to regularization is early stopping. Training too long can cause overfitting, meaning the neural network’s local optimum may only be performant on its training data.
The solution is simply to train the neural network for less time. One way to implement this is to halt training when the model’s validation error hasn’t measurably improved for x epochs. This technique is called “early stopping” and can be thought of as a close cousin of the pocket algorithm for perceptrons.
This is the basic process for using early stopping:
- Set a counter to 0 and select a patience (an integer value for the number of epochs you’re willing to wait for the model to improve before you end training).
- Train your neural network for one epoch.
- Evaluate its performance against a reserved validation set.
- If the model is no more performant than on the last epoch, increment the counter.
- If the counter equals your patience, quit training; otherwise, repeat all steps.
These are just some of the most common ways to regularize neural networks. This is an active field of research, so many more sophisticated techniques are developed daily. However, leveraging some basic approaches like dropout, norm regularization, and early stopping can dramatically improve performance with minimal effort.
Want to learn more? At ODSC West 2018, Scott Clark will give a talk on state-of-the-art ways to tune neural networks using a technique called multitask optimization. You can learn from Clark and other experts on the cutting edge of data science research Oct. 31 to Nov. 3 in San Francisco.