Let’s say you run a factory producing rings and want to tell the two models you have from each other. Maybe you want to do this for sorting or quality control:
People are typically pretty good at this, but maybe this is just not what you want them to do for whatever reason. There are options, though, and If you are able to check those rings in a pretty controlled environment, classic image recognition is good at this as well. You check for a specific low-level feature telling one category from another. In this example it could simply be: match a ring and check its diameter:
Detecting a circle can be done using the so-called Hough Transformation. Having detected the cycles of the rings, we can calculate the diameter which is quite distinct and actually allows us to tell one type of ring from the other. Libraries like OpenCV provide us with implementations for such basic routines including the Hough Transformation.
How to spot the fine Austin Squirrel – Classic might not always be enough
Now let’s have a look at those two pals, lovely aren’t they? You may notice that they have developed some sort of natural camouflage and their fur looks a lot like the trees they are hiding in.
Good for the squirrel, bad for us when we want to recognize animals like them. Looking for low-level features like certain colors or patterns will hardly be successful. Instead, we will have to use more high-level patterns that must cover the wide range of squirrels occurring in their natural habitat. It turns out that machine learning using a sequence of specific neural network layers is just the thing to do that.
Finding out which architecture suits best our specific image recognition task is something we should leave to the academic world. Instead, it is helpful to know what kinds of pre-defined architectures exist, how to choose the right one, and how to make them train on our examples. TensorFlow with its Keras APIs has all those architectures pre-defined and even pre-trained for us to train further or adapt and fine-tune to our specific applications.
How to make sure we look for the right things?
The literature about image recognition is full of anecdotes of things like tanks being recognized by the snow or the blue skies they have been photographed in instead of the tank itself. The issue described here is called “overfitting”. Overfitting occurs when a machine learning model learns all kinds of features of the examples it is trained on, but does not concentrate on the relevant ones. It thus is not general enough to recognize similar objects that were not in the training set.
Using tools like Alibi Explain we can segment the image into its parts called superpixels. They can then be combined into so-called anchors and sent through the network until the network is sure to see the same thing as in the complete image. This way we can check what the network sees as essential in the image:
In the example above this looks reasonable. The anchor is pretty much the same as what we would need to recognize a cat as well. In the examples of the tanks, such a procedure would rather see the background and not the object. This way we would know that the network has not been trained properly and will not generalize well to anything it has not seen before.
And the future?
The classic techniques and machine learning described here are both well established and work well in practice. They can be seen as the past and the present of image recognition. But what comes next? There are a couple of techniques that certainly look promising, but they are maybe not readily available or are not quite as mature.
The most promising approach for the future is called Vision Transformer (ViT). The idea is to phrase image recognition as a language problem using the successful transformer architecture. Images are being split into patches of sub-images and transformed into tokens which are then passed into a pretty standard transformer architecture.
Another approach is to train images along with their descriptions and letting. After training people can write descriptions for images to generate. Sticking with our previous example of squirrels this is my experiment to let the state-of-the-art model DALL-E 2 generate some very special squirrels for us using the description “A chubby green squirrel on the moon”:
My Workshop at ODSC West 2022 in San Francisco
On the level described here, all might sound straightforward and simple. However, when you want to actually employ those techniques quite a few challenges come up, even if you are familiar with machine learning already.
Those challenges range from finding out if a traditional or rather a machine learning approach is a good fit for your problem, over what ML architecture to use to how to make sure your model actually does what it claims to do.
My technical, hands-on workshop “Image Recognition with OpenCV and TensorFlow” covers those topics. It will be held in-person at ODSC West 2022 in San Francisco.
Link to workshop and more details: https://odsc.com/speakers/image-recognition-with-opencv-and-tensorflow/
Oliver Zeigermann is the head of artificial intelligence at German consulting company OPEN KNOWLEDGE (https://www.openknowledge.de/). He has been developing software with different approaches and programming languages for more than 3 decades. In the past decade, he has been focusing on Machine Learning and its interactions with humans.