Following on Part I of “Why are Convnets often Better than the Rest?”, we will now look at a traditional neural network’s weakness. As before, I am going to assume you understand the basics of neural network modeling already, including how images are applied to input layers.
Through the course of this article, I will make various abstractions of neural network models to better illustrate and support my explanation. If these do not make sense, I recommend you take a look at my article “Getting to Know Neural Networks with Perceptron.”
Fully-Connected Layers as a Main Weakness in Image Classification
When I refer to a traditional neural network, I mean a standard feed forward model where each layer is fully connected. In image classification, however, a fully-connected architecture for all layers is actually the key weakness.
For clarification, this is what fully-connected looks like:
With a fully-connected neuron structure like this, there’s only one group of neurons processing the image as a single region, which is problematic. Why would a fully-connected neural network model be problematic? Simply because there is less specialization in observing distinct features; the model will only observe the input image as one large feature.
For example, if a dog and a wall are in the same image, a fully-connected neural network architecture will assume they are the same defining entity. Hundreds of additional data samples are needed to start demonstrating a difference between the dog and the wall.
For an individual neuron on a following layer, the feed forward algorithm could be written as:
Where r is the number of neurons on the previous layer, i is the specific neuron number, w is the value of the lth weight on the ith neuron, and a is the value of neuron i.
The Alternative to Consider
Let’s take a look at the below image to better understand how the alternative works and why it might be more efficient than a fully-connected neural network.
For clarification purposes, I have separated the example into four images. You will also notice I have ordered the neurons in a square shape in an attempt to demonstrate how the various groups of neurons make up different parts of the image.
The essential function of this model is based upon the fact that only a portion of the neurons in the model are used to analyze and formulate a portion of the output. This structure acts as a major benefit of this model because it drastically decreases computation time as there are fewer weights to handle.
As you can see above, on the bottom layer of every square model, each neuron connects to a group of neurons on the previous (top) layer. Note that many neurons on the first (top) layer will never connect to some neurons on the second (bottom), hence why this model is not fully connected. As such, each neuron on a corresponding layer is specialised, and responsible for understanding individual regions of an image.
These individual regions are known as local receptive fields and they can be designed with a couple different parameters. Firstly, we can experiment with the field’s width and height. In the example above, the local receptive field is 2×2. It is important to tune the size of the field correctly, because if the size is too small, the model can overlearn by comparing too many different detailed features. If the field was too big, you will end up with something close to a fully-connected model, which is what we are moving away from.
Secondly, we can manipulate the stride length, which is the gap between each local receptive field. In the example above, there is a stride length of one as each local receptive field moves along by one column (or down by one row). This is essentially a compromise between computation time and accuracy. Think of it this way: the greater the stride length, the less time it takes for the model to compute. This is a benefit for saving time but the model might leave details out when analyzing outputs.
For an individual neuron on a following layer, the feed forward algorithm can be written as:
This equation has some similarities to the one from earlier. For one, instead of looking at a layer with one dimension, we define both i and m to reference a specific neuron. You can also reference weights with i and m as each neuron will only ever be associated with one output weight value.
To access a neuron value on the previous layer, we use both j+i and k+m. This may seem confusing due to the fact that we only use i and m when accessing a weight value. In actuality, i and m are local references within a specific local receptive field. By contrast, we use j and k as more public references in the scope of the entire matrix to locate the local receptive field. Here is a quick visualization of this explanation:
In this example, j and k reference the dark green neuron, and the lighter green neurons (belonging to the local receptive field) will be referenced via i,m. You may wonder why the weights do not use j,k in the feed forward equation. This is intentional; it is another core design of a convolutional net. Weights are shared across each local receptive field. In the next article of this series, we will look at this in more detail.