Editor’s note: Ray Reed is a speaker for ODSC APAC 2022 this September 7th-8th. Be sure to check out his talk, “Monitoring CV Systems: A Unique Solution to a Unique Problem,” there to learn more about monitoring image data.
As machine learning ecosystems become increasingly complex and data volumes grow at a dizzying rate, maintaining observability of your data and model health is more critical than ever. Data drift, concept drift, and data quality degradation can cause models to fail. In the worst cases, these are silent failures and can go undetected for months.
When working with tabular datasets, data quality can be monitored by capturing telemetry around missing values, cardinality, and data types for each of a dataset’s features. Data drift can be monitored by capturing descriptive statistics and approximate distributions of your data and computing statistical distances such as the Hellinger distance or KL Divergence. However, it’s not obvious how a similar approach can be applied to image data. As we will see, monitoring unstructured data such as images can be achieved by capturing telemetry which is structured and therefore, compatible with common statistical approaches.
Before moving forward, we must consider the types of problems that threaten computer vision systems. Once we’ve established this much, we can design a solution around these challenges.
There are a variety of physical factors which can impact the consistency and quality of image data such as…
- Device used
- Device Settings
- Changes in environment
- Changes in the object(s) being detected
Hardware is an obvious example. Different devices (or device settings) can impact the size and resolution of images, along with many other properties. Suppose a healthcare company upgrades to a medical imaging device with a higher resolution. While these images may make diagnoses easier for human doctors, computer vision systems trained on images generated by the original device may actually perform worse.
Physical changes in the environment pose another challenge. For example, consider a computer vision system responsible for performing inspections of products on an assembly line. If the factory begins using a different type of light bulb, this could impact the images in a way that the machine learning model was not prepared to handle. While this is not likely to throw a hard error, the model’s performance can plummet, unbeknownst to those responsible for the model.
Aside from hardware and lighting, there are a countless number of potential challenges that can present themselves directly in the scene being imaged. There can be changes in the image background, the size or number of target objects (object detection), or target objects which are significantly different from anything encountered in the training set.
Images often have inconsistent lighting, backgrounds, image quality, object sizes, object counts, etc.
Data Pipeline Factors
Even if we can ensure that the raw images are consistently representative of our training data, these images often travel through a complex pipeline, which introduces many possible points of failure such as…
- Swapped color channels
- Inconsistent color spaces
- Inconsistent scaling
Different image processing frameworks may read in color channels in a different order, causing a channel swap when migrating to a new tool. A newly introduced bug may result in inconsistencies in whether grayscale or colored images are passed to our model or in the way that pixel values are scaled. Suppose we train a model that expects pixel values ranging from 0-255. If we begin scaling these values from 0.0-1.0, this could result in a model effectively being fed black, featureless images.
Data pipeline issues may cause inconsistencies in pixel value scaling, channel ordering, or the number of channels
In order to effectively monitor the issues described above, we need to compute metrics that are sensitive to these events. For example, calculating the mean pixel value of an image can serve as a measure of image brightness and can be leveraged to monitor for changes in image lighting. Most image processing tools can compute hue and saturation which provide information about the color palette of an image. These quantities can be used to monitor for things like changes in image backgrounds or issues such as the swapping of color channels in an image. Monitoring image height can be done trivially by capturing the shape of the tensor representing the raw image data. The number of channels in this tensor can be used to infer the colorspace of an image (RGB, CMYK, Grayscale).
In many cases, valuable metadata is available in the image file in the form of Exif data. Exif data often includes information such as the device make and model, the camera settings while taking the photo, as well as geolocation, date, and time associated with the image. Consider a model trained to identify plant species from images. Capturing geolocation from image exif data can help to inform ML engineers whether new images are expected to contain plant species that weren’t included in the training dataset.
Example of camera settings extracted from Exif data
Monitoring at Scale
Now that we know what kind of telemetry we wish to capture for images, how can we turn this into a full-fledged monitoring solution? The team at WhyLabs has developed an open source data logging library, whylogs, which was designed to capture valuable telemetry in an efficient and customizable way for any dataset. whylogs is designed with customizability as a priority, enabling users to integrate whylogs with any data pipeline, whether you’re working with image data, tabular data, text, or something else.
Furthermore, users can leverage powerful anomaly detection, informative visualizations, and automated notifications by uploading profiles to the WhyLabs AI Observatory for an end-to-end monitoring solution. Your first model is free!
To learn more about monitoring computer vision systems at scale, be sure to attend my talk at ODSC APAC, 2022 entitled “Monitoring CV Systems: A Unique Solution to a Unique Problem.”
About the Author/ODSC APAC 2022 Speaker on monitoring image data:
Ray Reed is a Customer Success Data Scientist at WhyLabs, the AI Observability company. He has a long-held passion for machine learning and loves helping customers save time and money by monitoring their ML systems at scale. Ray was formerly a Senior Success Engineer at Datorama, a Salesforce Company, where he drove success for large enterprise customers with a focus on improving query performance across the company. In his spare time, Ray enjoys hiking, music, and more hiking.