State-of-the-art image search technology uses both natural language processing and computer vision. Traditional NLP tasks like classification, named entity recognition, and translation are being combined with advances in computer vision so as to automate tasks like image captioning and website creation. Exciting as these advances have been, they have been primarily restricted to use by large tech firms and internet giants. These tasks, to be accomplished with any accuracy to be useful, demand very large datasets and scalability. This is often the case, and has been the primary barrier to entry for smaller firms and startups.
In his talk at ODSC West 2018, Matthew Rubashkin walks through some of the different approaches that have typically been undertaken in the creation of image search engines. The first that he mentions is known as an “end-to-end” mode, which uses a single image as a baseline which creates similarity scores for the entire dataset. This process is quick and effective for datasets with only a few hundred images. Once the data exceeds a certain amount, however, scalability issues cause a rapid deterioration of usefulness.
The other option is to use Siamese Networks to create a pairwise similarity score between two images, and to use this score as the baseline for the dataset. This is the process typically used by larger firms, as it is scalable to larger data sets. It requires a significant amount of processing power to use, and is both slow and computationally expensive, making it an unlikely option for all but the largest and richest companies. Another drawback of this model is that it is not capable of leveraging text as a search feature.
A third approach, as explained by Rubashkin, uses image embeddings to calculate similarity ahead of time, bypassing the time-consuming and imperfect process of creating similarity scores. This approach is flexible, fast, and scalable to multiple architectures. Beyond this, it is capable not only of generating similar images from an image input or recognizing words similar to an input word, but can be cross-trained sufficiently to generate tags for images and to search for images based on text.
This method is not perfect: Rubashkin notes that sorting by embedding places images into specific categories, which can limit search results and lead to inaccuracies. One workaround to this has been to use a “semi-supervised” approach, which essentially amounts to biasing the model after the initial training as specific errors become evident. More promising for a large-scale fix has seems to be the leveraging of text, using NLP to emphasize the differences or similarities between images that are deemed similar by image-only models. As it stands right now, Rubashkin’s model adds a layer that classifies images by type, so that, per his example, the model is able to understand that the difference between “dog” and “airplane” is greater than that between “dog” and “cat.” Without text-based training, the computer is incapable of discerning levels of difference.
Rubashkin’s approach to image searching provides possibilities for widely used image search modeling, and advances in NLP promise to make hybridized models even more efficient. Watch the full talk below, and make sure to attend ODSC East 2019 for more exciting developments.