Convolutional Neural Networks (CNNs) leverage spatial information, and they are therefore well suited for classifying images. These networks use an ad hoc architecture inspired by biological data taken from physiological experiments performed on the visual cortex. Our vision is based on multiple cortex levels, each one recognizing more and more structured information. First, we see single pixels, then from that we recognize simple geometric forms, and more sophisticated elements such as objects, faces, human bodies, animals, and so on.
This article is an excerpt from the book, Deep Learning with TensorFlow 2 and Keras – Second Edition by Antonio Gulli, Amita Kapoor, and Sujit Pal. This book teaches deep learning techniques alongside TensorFlow and Keras. You’ll learn how to write deep learning applications in the most powerful, popular, and scalable machine learning stack available. In this article, we’ll look at the ways in which CNN architecture can be utilized when applied to the area of image processing, and the interesting results that can be generated.
Composing CNNs for complex tasks
The basic CNN architecture can be composed and extended in various ways to solve a variety of more complex tasks. In this article, we will look at the computer vision tasks in the following diagram and show how they can be solved by composing CNNs into larger and more complex architectures:
Classification and localization
In the classification and localization task not only do you have to report the class of object found in the image, but also the coordinates of the bounding box where the object appears in the image. This type of task assumes that there is only one instance of the object in an image.
This can be achieved by attaching a “regression head” in addition to the “classification head” in a typical classification network. Recall that in a classification network, the final output of convolution and pooling operations, called the feature map, is fed into a fully connected network that produces a vector of class probabilities. This fully connected network is called the classification head, and it is tuned using a categorical loss function (Lc) such as categorical cross entropy.
Similarly, a regression head is another fully connected network that takes the feature map and produces a vector (x, y, w, h) representing the top-left x and y coordinates, width and height of the bounding box. It is tuned using a continuous loss function (Lr) such as mean squared error. The entire network is tuned using a linear combination of the two losses, that is:
Here is a hyperparameter and can take a value between 0 and 1. Unless the value is determined by some domain knowledge about the problem, it can be set to 0.5.
The following figure shows a typical classification and localization network architecture. As you can see, the only difference with respect to a typical CNN classification network is the additional regression head on the top right:
Another class of problem that builds on the basic classification idea is “semantic segmentation.” Here the aim is to classify every single pixel on the image as belonging to a single class.
An initial method of implementation could be to build a classifier network for each pixel, where the input is a small neighborhood around each pixel. In practice, this approach is not very performant, so an improvement over this implementation might be to run the image through convolutions that will increase the feature depth, while keeping the image width and height constant. Each pixel then has a feature map that can be sent through a fully connected network that predicts the class of the pixel. However, in practice, this is also quite expensive, and it is not normally used.
A third approach is to use a CNN encoder-decoder network, where the encoder decreases the width and height of the image but increases its depth (number of features), while the decoder uses transposed convolution operations to increase its size and decrease depth. Transpose convolution (or upsampling) is the process of going in the opposite direction of a normal convolution. The input to this network is the image and the output is the segmentation map. A popular implementation of this encoder-decoder architecture is the U-Net (a good implementation is available at: https://github.com/jakeret/tf_unet), originally developed for biomedical image segmentation, which has additional skip-connections between corresponding layers of the encoder and decoder. The U-Net architecture is shown in the following figure:
The object detection task is similar to the classification and localization tasks. The big difference is that now there are multiple objects in the image, and for each one we need to find the class and bounding box coordinates. In addition, neither the number of objects nor their size is known in advance. As you can imagine, this is a difficult problem and a fair amount of research has gone into it.
A first approach to the problem might be to create many random crops of the input image and for each crop, apply the classification and localization networks we described earlier. However, such an approach is very wasteful in terms of computing and unlikely to be very successful.
A more practical approach would be using a tool such as Selective Search (Selective Search for Object Recognition, by Uijlings et al), which uses traditional computer vision techniques to find areas in the image that might contain objects. These regions are called “Region Proposals,” and the network to detect them was called “Region Proposal Network,” or R-CNN. In the original R-CNN, the regions were resized and fed into a network to yield image vectors:
These vectors were then classified with an SVM-based classifier and the bounding boxes proposed by the external tool were corrected using a linear regression network over the image vectors. A R-CNN network can be represented conceptually as shown in Figure 5:
The next iteration of the R-CNN network was called the Fast R-CNN. The Fast R-CNN still gets its region proposals from an external tool, but instead of feeding each region proposal through the CNN, the entire image is fed through the CNN and the region proposals are projected onto the resulting feature map. Each region of interest is fed through a Region of Interest (ROI) pooling layer and then to a fully connected network, which produces a feature vector for the ROI.
ROI pooling is a widely used operation in object detection tasks using convolutional neural networks. The ROI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H × W (where H and W are two hyperparameters). The feature vector is then fed into two fully connected networks, one to predict the class of the ROI and the other to correct the bounding box coordinates for the proposal. This is illustrated in Figure 6.
The Fast R-CNN is about 25x faster than the R-CNN. The next improvement, called the Faster R-CNN (an implementation can be found at, removes the external region proposal mechanism and replaces it with a trainable component, called the Region Proposal Network (RPN), within the network itself. The output of this network is combined with the feature map and passed in through a similar pipeline to the Fast R-CNN network, as shown in Figure 7. The Faster R-CNN network is about 10x faster than the Fast R-CNN network, making it approximately 250x faster than an R-CNN network:
Another somewhat different class of object detection networks are Single Shot Detectors (SSD) such as You Only Look Once (YOLO). In these cases, each image is split into a predefined number of parts using a grid. In the case of YOLO, a 7×7 grid is used, resulting in 49 subimages. A predetermined set of crops with different aspect ratios are applied to each subimage. Given B bounding boxes and C object classes, the output for each image is a vector of size (7 * 7 * (5B + C)). Each bounding box has a confidence and coordinates (x, y, w, h), and each grid has prediction probabilities for the different objects detected within them.
The YOLO network is a CNN that does this transformation. The final predictions and bounding boxes are found by aggregating the findings from this vector. In YOLO a single convolutional network predicts the bounding boxes and the related class probabilities. YOLO is the faster solution for object detection, but the algorithm might fail to detect smaller objects.
Instance segmentation is similar to semantic segmentation—the process of associating each pixel of an image with a class label—with a few important distinctions. First, it needs to distinguish between different instances of the same class in an image. Second, it is not required to label every single pixel in the image. In some respects, instance segmentation is also similar to object detection, except that instead of bounding boxes, we want to find a binary mask that covers each object.
The second definition leads to the intuition behind the Mask R-CNN network. The Mask R-CNN is a Faster R-CNN with an additional CNN in front of its regression head, which takes as input the bounding box coordinates reported for each ROI and converts it to a binary mask :
In April 2019, Google released Mask R-CNN in open source, pre-trained with TPUs (https://colab.research.google.com/github/tensorflow/tpu/blob/master/models/official/mask_rcnn/mask_rcnn_demo.ipynb). I suggest playing with the Colab notebook to see what the results are. In Figure 9 we see an example of image segmentation.
In this section we have covered, at a somewhat high level, various network architectures that are popular in computer vision. Note that all of them are composed of the same basic CNN and fully connected architectures. This composability is one of the most powerful features of deep learning. Hopefully, this has given you some ideas for networks that could be adapted for your own computer vision use cases.
[Related article: The Most Influential Deep Learning Research of 2019]
In this article we explored how CNN architecture in image processing exists within the area of computer vision and how CNN’s can be composed for complex tasks. Build machine and deep learning systems with the newly released TensorFlow 2 and Keras for the lab, production, and mobile devices with Deep Learning with TensorFlow 2 and Keras – Second Edition by Antonio Gulli, Amita Kapoor, and Sujit Pal.
About the Authors
Antonio Gulli is a software executive and business leader with a passion for establishing and managing global technological talent, innovation, and execution. He is an expert in search engines, online services, machine learning, information retrieval, analytics, and cloud computing.
Amita Kapoor is an Associate Professor in the Department of Electronics, SRCASW, University of Delhi and has been actively teaching neural networks and artificial intelligence for the last 20 years. She is an active member of ACM, AAAI, IEEE, and INNS. She has co-authored two books.
Sujit Pal is a technology research director at Elsevier Labs, working on building intelligent systems around research content and metadata. His primary interests are information retrieval, ontologies, natural language processing, machine learning, and distributed processing. He is currently working on image classification.