In this article, we’ll walk through building a convolutional neural network (CNN) to classify images without relying on pre-trained models. There are a number of popular pre-trained models (e.g. Inception, VGG16, ResNet50) out there that are helpful for overcoming sampling deficiencies; they have already been trained on many images and can recognize a variety of features. These models typically have complex architectures that are necessary when deciphering the difference between hundreds or thousands of classes. The complexity that offers predictive capacity for a variety of objects can be a hindrance for more simplistic tasks, as the pre-trained model can overfit the data. Additionally, the architecture can be difficult for a beginner to conceptualize. Luckily, Keras makes building custom CCNs relatively painless. If you are unfamiliar with convolutional neural networks, I recommend starting with Dan Becker’s micro course here.
We’ll use a publicly available dataset called CelebA (available here) so you can recreate the model on your own. We’ll be predicting gender based on headshots of celebrities; the images have been already labeled by their assumed binary gender.
First, organize the images into train, validation and test folders. Each should have a sub-directory of male and female folders.
You can determine the sample sizes on your own, but I used 32,000 images for training, 8,000 for validation, and 1,000 for testing.
We’re going to build our custom model with Keras layers, so we’ll import the following dependencies. The code for this exercise can be found here.
We’ll start by building the neural network by stacking sequential layers on top of each other. Remember, the purpose is to reduce the dimensionality of the image and identify patterns related to each class. In the code below, we’ll start building a sequential model called “my_model”. The first convolutional block includes a convolutional layer “Conv2D” and a “MaxPooling2D” layer. The convolutional layer uses 16, 3 by 3 pixel filters that are applied to each part of the image, returning 16 arrays of activation values called feature maps that indicate where certain features are located in the image. The max pooling layer reduces the dimensionality of the feature maps by converting 2 by 2 pixel grids across the image to one pixel that represents the maximum activation value in that grid. In essence, we want to create feature maps the same dimension as the input image, then resize by 50%.
Now we want to stack additional convolutional layers that allow the model to identify more fine-scale patterns. You can imagine that the difference between typical male and female headshots involves basic attributes like hair length that are fairly easy to identify, but also more nuanced attributes like facial features. To accommodate more fine-scale feature extraction, we’ll stack three more convolutional blocks, each with a higher number of filters than the previous block.
So now our model has 128, 12 by 14 pixel feature maps (one for each filter in the last convolutional layer). To make predictions, we need to reduce the feature maps to a vector using global average pooling (GAP). GAP takes the average activation value in each feature map, and returns a one-dimensional tensor. The GAP layer concludes the feature extraction part of the model. We’ll add a dense layer with 64 nodes after the GAP layer, but prior to the layer that makes predictions. This extra fully connected layer allows for greater complexity in the relationships between the features extracted by the convolutional blocks and the predictions. We’ll also add a batch normalization layer, that ensures activation values from the previous dense layer are on the same scale for each batch by transforming them to Z-scores. Briefly, we want activation values in each layer to be on the same scale to reduce instability in the fitting process that can result from very large values cascading through the network. Lastly, we’ll add the dense layer that makes predictions for the two classes (male or female). Because it is a binary outcome, we’ll use a sigmoid activation function instead of softmax that many are familiar with.
Before running the model, take a look at the architecture with “my_model.summary()”.
Two useful functions in Keras are “EarlyStopping” and “ModelCheckpoint” that allows the best model to be saved to a directory automatically during the training process.
Next, to compile the model. We’ll use the ADAM optimizer because it allows for the learning rate to get smaller over time, an advantage when estimating a large number of weights. Because we’re making binary predictions, we’ll use binary cross-entropy for our loss function.
Lastly, we’ll setup the data generators, a function in Keras that draws random batches of images from directories specified for training or validation sets. In fitting the model, we’ll use an arbitrary 30 epochs and see how our model performs.
After training, we can plot the model performance for each epoch. Validation accuracy peaked at 95%, not bad!
Lastly, we’ll use the model to make predictions from a 1000 image test set. The code for the making predictions on the test set can be found here. The model predicted gender on the test set with 95% accuracy.
By building a model layer by layer in Keras, we can customize the architecture to fit the task at hand. It is often useful to try a number of different architectures to see which exhibits superior performance. Any constructive criticism or feedback on this approach is welcome.