Editor’s note: Abon is a speaker for ODSC West this Fall! Consider attending his talk, “Computer Vision for E-Commerce: Intelligent Analysis and Selection of Product Images at Scale” then.
In retail, the role of product images is critical in delivering satisfactory customer experience. Images help online shoppers gain confidence in a product and increase their engagement with the product, increasing the likelihood of purchase. Hence, from the perspective of an omnichannel retailer like Walmart, images are an integral, vast, and valuable component of its catalog. We are motivated to analyze images for several reasons: to measure the quality of images, to detect and discard offensive pictures, to select and rank them by their relevance to a product.
Image analysis for a large product catalog such as ours typically goes through the following stages (not necessarily in this order):
Filter images by content or quality – This covers several binary classification problems, each addressing a quality (such as sharpness) or a compliance issue (such as violent or adult content). Some of them, especially the compliance problems, are ill-posed with severe class imbalance. Some of them are better treated as object detection than classification.
Classify images based on content – This stage categorizes the images of a product into several buckets based on their viewpoint or other characteristics such as lifestyle vs. solo images, product image vs packaging image and so on.
Extract content from images – In this step, specific information such as textual attributes is extracted from images wherever possible. An example of this is detection of the drug facts table from the picture of a medicine and extraction of the ingredients. Another example is intelligent extraction of a flat or a textured square region from apparel images that can be used as a thumbnail or swatch on the website.
For curious readers, here are the links to our recent papers that discuss our image analysis pipeline in more detail:
We deal with a number of challenges while building models or algorithms that are part of the above mentioned pipeline. Let us focus on one of them in this article: shortage of training data. The challenge, far more commonplace than you think, arises primarily for two reasons:
1. The problem we want to detect manifests rarely, making it difficult to find examples of the “positive” class. A typical example of this would be an offensive image such as a racially inappropriate symbol on a hat. Usually, one such image is found and reported by a customer by accident. In reality, less than 0.01% of the products have such an image. However, there is no easy way to sieve through the catalog to find more examples.
2. The scale of the problem makes image annotation prohibitively expensive. Let us say that we want to classify images into five viewpoints – top, bottom, left, right, and close-up – for a product, and then scale it for 10,000 types of products. This means we have an extreme-scale classification problem with ~50,000 classes to solve. Since different viewpoints are often close to each other in color and shape, we do need a decent amount of annotated images for each class. Also, for such a fine-grained task, we should probably leverage a trained crowd, which is more expensive, as opposed to completely anonymous ones to ensure a better quality response. Even if we ask for 10 annotated images per class, the cost of annotating half-a-million images can be too high for many projects in many companies.
We often take recourse to a number of practical strategies – some conventional and some ad hoc or unique – to deal with this challenge.
1. Data augmentation: Standard techniques of image data augmentation include color and geometric space transformations. They often cannot produce enough useful data, hence problem-specific custom techniques such as superimposition or image synthesis are applied.
2. Few shot learning: When only a few examples of a class are known in advance, a mix of few shot learning and a conventional classifier often produces better results than one of them alone. For example, consider the picture of an “energy guide” of a television. The look of it remains almost the same regardless of the brand and model of the televisions. Hence, it is possible to build a classifier for this class with very few examples. However, the “close-up” views of the different models of a television vary so much that a few-shot learner will easily overfit.
3. Iterative training – When we do not have enough training data to build a high-precision model, we start with shallow linear classifiers, very small neural nets or heuristic-based classifiers. These baseline models, which often work as low precision and moderate recall predictors, are used to generate predictions. Depending on the crowdsourcing budget, a percentage of the high confidence predictions are sent for manual review. The reviewed images are fed back to the baseline classifiers, and they are retrained. This process is repeated until some base models are good enough, or we have enough data to develop a full-scale model.
4. Multi-stage inference – When it is expensive to procure training data for a complex task or it is compute-intensive to run a complex model on the entire set of products, we try to divide the problem into two. A precursor, namely a simpler image model or a non-image model that learns from contexts such as product title and category, is added before the main model. A typical example would be the detection of nudity – a problem more likely to occur in certain categories such as wall arts or books. Hence, adding a faster and lightweight classifier that separates books and wall arts from the rest of the catalog reduces the load on the slower and deeper object detection network that is trained to detect nudity. Also, the training data can be collected from those categories only instead of the entire catalog.
5. Transfer (and meta) learning – Last but not the least, appropriate use of transfer learning and meta-learning, when possible, can produce high-quality results with a relatively small amount of data. If the problem at hand involves classes similar to the ones available in public datasets such as Imagenet or Coco (for example, we want to detect rocking chairs that are similar to dining chairs); or the problem requires finer classification of such a class (for example, we want to distinguish between left-facing and right-facing pictures of shoes), beginning from a pre-trained model and fine-tuning it is a great practical idea.
Overall, developing classification and object recognition solutions for real problems and scaling them for an enormous catalog is an intriguingly complex problem. To understand more about how we learn from the textbooks and then deviate from them to solve these problems, consider attending my talk at ODSC West in San Francisco from October 29th-November 1st.