Choosing the right activation function in a neural network Activation functions are one of the many parameters you must choose to...

Choosing the right activation function in a neural network

Activation functions are one of the many parameters you must choose to gain optimal success and performance with your neural network. In this article I’m going to assume you understand the basics of how a neural network works, and will cover specifically the process involving activation, and of course, the many different ways you can go about it. During the feedforward process, each neuron will take the sum of the neurons (multiplied by their connecting weight value) on the previous layer. For example:


n5 = (n1 * w1) + (n2 * w2) + (n3 * w3) + (n4 * w4)

n6 = (n1 * w5) + (n2 * w6) + (n3 * w7) + (n4 * w8)

n7 = (n1 * w9) + (n2 * w10) + (n3 * w11) + (n4 * w12)


Each neuron value needs to be minimised, as original input data can be very diverse, and perhaps out of proportion. Before forward feeding further, n5,n6,n7 must be activated. In simple terms, there are a series of functions you could use that act as a linear or non-linear threshold on values arriving at a neuron (such as n5, n6, and n7).


A() is the activation function, which is often said to “squash” it’s input into a more conforming and proportional value (depending on your choice of function). It is usually to a value  between 0 and 1, with many decimal places. However, the detail comes with how It squashes the input, and what exact function should be used to do this.


The step function is the most simple. It states a static threshold of usually 0.5 (but could be 0),  and outputs a 1 or 0 based on whether the input value is greater than or less than the threshold. Bare in mind, the input values will nearly always be between 0, 1 (or maybe -1, 1), because weight values always are, as are the first layer of neurons.


def step(input): return 1 if (input > 0.5) else 0


This is by nature  a very binary approach, and should be used when input data could be described as a binary classification problem. An example of this could be teaching a model the OR function, where:


0,1 = 1

1,0 = 1

0,0 = 0

1,1 = 0


The model would have two input neurons, a hidden layer of approximately four neurons, and an output layer of one neuron. On each layer, a step function is all that is required for the activation, as the problem is binary.


The activation function most used on a trivial basis is the sigmoid function (blue), and would look like this on a graph, in comparison to the step function (orange):


No matter how high or low the input value, it will get squashed to a proportional value between 0 and 1. It is considered a way of converting a value to a probability, which comes to reflect a neurons weight or confidence. This introduces nonlinearity to a model, allowing it to pick up on observations with greater insight. By default, you can use this sigmoid function for any problem, and expect some results.


The output can never truly be 1, as that is the upper horizontal asymptote. Likewise with 0, the outputs will always tend towards it without ever reaching it. Of course, in a program, there will be a point where outputs are rounded off.


Here are some example inputs and outputs, so you can see exactly what is going on:


Obviously, S() is the sigmoid function. When back propagating, and needing to find a margin of error in regard to individual weights,  you will need to step back through the Sigmoid function with its derivation:

 width=The Tanh function is very similar to the sigmoid function, certainly shape wise. However, it’s ranges are greater. Instead of returning values between 0 and 1, it will give the range of -1 and 1.  This will stress observation, and is essentially much more specific. Therefore, it is appropriate for more complex problems where classifications differ with a lower threshold. This can lead to over learning if your data is relatively simple. The equation for TanH is, as you can see, very similar to the Sigmoid.


The derivation of the TanH function is: width=

The Rectified Linear Unit (ReLU) activation function is the most popular  and successful function used in deep learning. This may seem surprising at first, as so far the very nonlinear functions seem to work better. The benefit of the rectifier actually comes later in the back propagation. There is a common rule of thumb that more layers on a neural network should lead to more success, however this causes a famous problem with many nonlinear activation techniques, such as Sigmoid and Tanh. The problem is known as vanishing gradient descent, which single handedly ruins the great opportunity that deep learning (having many layers) offers.


If we have a look at one horizontal slice of a small neural network, perhaps with only one hidden layer, vanishing gradient descent will not be too much of an issue:

 width=As you can see, on each neuron, S() gets called again. What this really means is S(S(x)). You are essentially squashing the value again and again. Please bare in mind I have ignored the process of multiplying weights and summing with other neurons for simplicity. The more layers, the worse it gets. For now, going from 0.68997 to 0.665961 may be fine, but then imagine this:  



…You would end up with a value whose meaning and integrity has vanished when back propagated with the derivation of the Sigmoid function:


The advantage of the Rectifier is it does not squash values, as it uses a very simple, static method:


R(x) = max(0,x)


It literally just maps any negative values to zero, while keeping all positive values as is. This is why the Rectifier is used in more complex neural networks, such as deep convolutional networks. There are no layer limitations. However, the Rectifier does lose the advantage of minifying values, and preventing overflow or blow up issues. In other words, it can be incapable of handling very big values, as it does not attempt to squash them. Another problem with the Rectifier is that in some more extreme cases, it can kill off a neuron. Imagine after many back propagations, a particular weight ended up adjusting to a very large negative value over time. In turn, this value would multiply by the previous neuron, and constantly cause a negative number as an input to the next neuron. As a result, R(x) would output zero every time, which is knowing as a dying neuron (notice, because there is technically a chance of recovery, it is not a dead neuron). Because of this, there are more insightful versions of ReLU, such as Parametric and Leaky Rectified Linear Unit,  (or PReLU and LReLU), which both don’t just map any negative value to 0, but instead (green):



Caspar Wylie, ODSC

Caspar Wylie, ODSC

My name is Caspar Wylie, and I have been passionately computer programming for as long as I can remember. I am currently a teenager, 17, and have taught myself to write code with initial help from an employee at Google in Mountain View California, who truly motivated me. I program everyday and am always putting new ideas into perspective. I try to keep a good balance between jobs and personal projects in order to advance my research and understanding. My interest in computers started with very basic electronic engineering when I was only 6, before I then moved on to software development at the age of about 8. Since, I have experimented with many different areas of computing, from web security to computer vision.