Upholding Data Quality in Machine Learning Systems Upholding Data Quality in Machine Learning Systems
In the dazzling world of machine learning (ML), it’s quite effortless to get engrossed in the thrill of devising sophisticated algorithms,... Upholding Data Quality in Machine Learning Systems

In the dazzling world of machine learning (ML), it’s quite effortless to get engrossed in the thrill of devising sophisticated algorithms, captivating visualizations, and impressive predictive models.

Yet, much like the durability of a building depends not just on its visible structure but also its hidden foundations, the effectiveness of machine learning systems pivots on an often-overlooked but entirely crucial aspect: the data quality.

The Imperative of Upstream Data Quality Assurance

Think of your ML training and inference pipelines as the journey of a steam train.

It’s critical to maintain the health of the train itself — the ML system — but what if the tracks are compromised?

If the quality of data feeding your system is not ensured upstream, it’s akin to a damaged rail track — your train is destined to derail, sooner or later, especially when operating at scale.

Therefore, it’s paramount to monitor data quality from the get-go, right at the source.

Like a train inspector examining the tracks ahead of a journey, we must scrutinize our data at its point of origin.

This can be achieved through a concept known as ‘Data Contracts’.

The Role of Data Contracts in Upholding Data Quality

Imagine being invited to a potluck dinner, where each guest brings a dish.

Without any coordination, you could end up with a feast entirely composed of desserts!

Similarly, in the vast landscape of data, there must be an agreement (i.e., the Data Contract) between data producers and consumers to ensure the produced data meets specific quality standards.

This contract is essentially a blueprint, encompassing a non-exhaustive list of metadata, such as:

  1. Schema DefinitionDetails of the data structure, like fields, data types, etc.
  2. Schema VersionEnsures consistency in light of alterations or improvements.
  3. Service Level Agreement (SLA) metadataSLA specifications to manage expectations.
  4. SemanticsClarifies meaning and interpretation of data.
  5. LineageChronicles the data’s journey, from origin to destination.

Let’s understand this better through an architecture that enforces Data Contracts.

Data Contracts in Action: An Example Architecture

Picture a manufacturing assembly line, where every worker knows their role and the standard they need to meet.

Now, let’s apply this concept to our data architecture.

  1. Schema changes are first carried out in version control and once approved, they are implemented in data-producing applications, databases, and a central Data Contract Registry. This is where your data contract enforcement ideally begins — at the stage of data production itself. Any validation steps further downstream act as safeguards to prevent low-quality data from infiltrating the system.
  2. The data, once produced, is pushed to some messaging systems like Kafka topics. This could include events directly emitted by application services or raw data topics for Change Data Capture (CDC) streams.
  3. Now, think of Flink applications as vigilant gatekeepers, consuming data from raw data streams and validating it against schemas in the Contract Registry.
  4. Data not meeting the contract — akin to rejects on an assembly line — is directed to the Dead Letter Topic.
  5. Validated data is approved for the Validated Data Topic, much like quality-approved goods ready for packaging and shipping.
  6. The validated data is then sent to object storage for another round of validation, acting as a double-check mechanism.
  7. On a schedule, the data in the Object Storage undergoes validation against additional SLAs in Data Contracts. After passing this scrutiny, it’s pushed to the Data Warehouse, where it’s transformed and modeled for analytical purposes.
  8. From here, the modeled and curated data takes a two-fold path. It’s sent to the Feature Store System for further feature engineering, and real-time features are ingested directly from the Validated Data Topic. Note that ensuring data quality at this stage can be challenging due to the difficulty of conducting checks against SLAs.
  9. This high-quality data is then utilized in Machine Learning Training Pipelines.
  10. The very same data is used for feature serving in inference.

Remember, ML Systems are also susceptible to data-related issues like Data Drift and Concept Drift.

While these are considered ‘silent failures’ and can be monitored, they aren’t typically included in the Data Contract.

We’ll dive deeper into the topic of data drift in a later article.

Concluding Remarks

The hidden strength of machine learning systems lies in the unseen integrity of the data fuelling them.

Data quality, albeit unglamorous, plays a pivotal role in the success of ML projects.

The concept of Data Contracts ensures that this vital aspect isn’t overlooked.

Remember, it’s not just about building the fastest train or the most impressive station, it’s equally about maintaining the quality of the tracks.

No matter how sophisticated your machine learning system may be, without high-quality data, its journey will be fraught with disruptions and potential derailments.

Keep this in mind and ensure data quality is given due importance in your machine learning endeavors.

After all, the most thrilling ML advancements are built not just on revolutionary algorithms, but also on the back of reliable, high-quality data.

Article originally posted here. Reposted with permission.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.