In 2012, AlexNet took first place at the ImageNet Large Scale Visual Recognition Challenge, marking the first time a convolutional neural network had won the image classification competition. One more factor that made this achievement much more significant is that AlexNet showed twice the accuracy than the second-place participant. In the years following, convolutional neural networks were rapidly integrated into computer vision projects to solve image classification, localization, object detection, segmentation, and other problems with state-of-art accuracy. CNNs became the most widely-used algorithm for different problems of computer vision. CNNs are applicable to everything connected with images and video streamings. Self-driving cars, security systems, anomaly detectors, medical assistants, and smart traffic regulation systems are just a few examples of where neural networks can be applied. Machine learning with computer vision is an exciting field: Computer-vision engineers are in high-demand, and top mass-media resources even predict that this field will continue to grow for at least 20 years.
This is the roadmap of where to start learning computer visualization and which topics deserve the most attention.
1. Create your own classification model with Keras
Have you ever played with Lego bricks? If so, you will not get into trouble using Keras, because this framework makes designing deep learning models incredibly simple–even simpler than putting Lego blocks together to build a castle. Of course, the model architecture depends on the complexity of the problem, the amount of data, and other parameters, but there’s no need to code convolutions or to compute recursive chains of derivatives for backpropagation with Keras. Just compose your network layer by layer and feed it with data.
As soon as you get into the main idea of training neural networks, the next items to learn might include how to properly pick hyperparameters such as learning rate or batch size, which activation function to choose, what Nesterov momentum is, and how learning rate can be optimized. To this end, you can create a categorical classification model trained on the Iris dataset, then apply convolutional neural networks to train handwritten digits recognizer with MNIST dataset. Don’t be afraid if something is going wrong – training neural networks usually fails on first attempts.
If everything above seems to be boring, feel free to create your own dataset. For example, shoot a few hundred photos of your pets and train your own pet classification model based on VGG-16, Inception, or ResNet architectures, or simply jump to the next step.
2. Create a Convolutional Neural Network from scratch with Numpy
The goal here is to get your hands dirty with coding a convolutional neural network without deep learning frameworks. This experience is really important, as debugging deep learning models without any understanding of what is inside is similar to playing Russian roulette with your model. That’s why it is crucial to understand how convolutions work and what backpropagation is, and generally to develop a deeper level of deep learning. Try to code manually all of the techniques that were set up in the previous step as parameters, and you will appreciate the difference between a high-level Keras framework and low-level programming with Numpy.
3. Train an object detection model with TensorFlow Object Detection API
Once you’ve eliminated the white spots regarding image classification, we can go deeper into computer vision. Now we are going to train the FasterRCNN object detection model based on ResNet, using TensorFlow object detection API on our own dataset including 3 to 10 classes, though the number of classes can be different. Training an object detector is an interesting task: To start, just clone Tensorflow models repository from Github and follow the installation setup.
Now it is time to create a dataset (or at least download an existing one). To start creating a dataset, download a few hundred images for each desired category and map all the images with annotations manually. Fortunately, there is no shortage of tools to make this process simple, though labeling images can take up to 80% of the time.
Aside from labeling images, the dataset creation process also requires training and validation subsets and generating *.record files that serve as input for both sets with a script. This may sound a little complicated, but there are plenty of step-by-step tutorials to help you get by, and TF models documentation is really handy and helpful.
We are not going to train the object detection model from scratch because even a tiny model with random weights initialization requires a couple of days of training with GPU. Instead, we have to download a pre-trained FasterRCNN ResNet101 model from the model zoo and train it with our data.
If your machine is powerful enough, it is worth running an evaluation in conjunction with training and launching Tensorboard to visualize the training process.
4. Learn how to work with OpenCV
OpenCV is considered to be a universal tool for computer vision problems. It includes many algorithms for image and video processing. The source code is written on C++ to make it run incredibly fast. Moreover, it has a Python API that makes OpenCV very handy and easy to use.
The initial challenge to understand OpenCV better is to detect objects on the video using the model from the previous step. If this was easy, try digging into OpenCV documentation, where you will find interesting algorithms like YOLO v3 object detection model, trained on COCO dataset. Or use a camera to create a solution to check if there is any food in the fridge.
As soon as you finish creating your project, pay attention to its productivity. Don’t forget about refactoring and consider applying threading. The application should run faster within multiple threads, though some frames may be lost.
Want to learn more about machine learning and computer vision in-person? Check out ODSC East 2020 this April 13-17 in Boston and get the hands-on training you need to become an ML expert in-person!