Data engineering and machine learning pipelines are both very different but oddly can feel very similar. Many ML engineers I have talked to in the past rely on tools like Airflow to deploy their batch models.
So I wanted to discuss the difference between data engineering vs machine learning pipelines.
So let’s dive in.
Over the past 5 years, data accumulation driven by the popularity of social media platforms such as TikTok and Snapchat has highlighted the difference between data and machine learning (ML) pipelines. These two are essential components of every company, but they are often confused. Although they share similarities, their objectives are vastly different.
Building a data and ML platform is not as simple as it sounds. It requires a lot of work, and it is what I do every day. In this post, we will explore the differences between how machine learning and data pipelines work, as well as what is required for each.
Data Engineering Pipelines
First, let’s dive into data pipelines. Data pipelines are a series of processes that extract data from different sources, clean it, and then store it in a data warehouse or database. This is important for companies because it helps them make informed decisions. A data pipeline is made up of four parts:
Data engineering pipelines generally feed into ML pipelines. However, there are many cases where communication breakdowns lead to ML engineers building custom end-to-end pipelines.
Data collection: This is where data is gathered from different sources like databases, APIs, and file systems. The data can be in different formats like structured, semi-structured, or unstructured. Data collection needs careful planning and execution to ensure high-quality data.
Data cleaning: This is where data is cleaned by checking for missing values, correcting errors, and transforming it into a consistent format. This makes sure that the data is accurate and reliable.
Data integration: This is where data from different sources is merged into a single dataset. This is important for businesses because it ensures consistent and accurate information for better decision-making.
Data storage: This is where processed data is stored. It’s important for quick and easy access to data, as well as data protection and security.
These 4 data pipeline parts are also often referred to as Extract (data collection), Transform (data cleaning and integration), and Load (data storage). The process of extracting, transforming, and loading data is commonly referred to as ETL or ELT.
Some will note that ELTs and ETLs are a subset of the overarching “data workflows” or data pipelines as there are several other sub-classes that often include streaming patterns such as pub/sub or CDC. — Ben Rogojan
These processes can be performed in various sequences, depending on the requirements of the project.
Regardless of how data pipelines are performed, they will generally always follow a linear process starting from the source and ending at storage.
Thus, that’s why you generally hear data engineers say they move data from point A to point B.
SeattleDataGuy’s Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Machine Learning Pipelines
Machine Learning (ML) pipelines don’t work in a straight line like data pipelines. Instead, they involve building, training, and deploying ML models. This is used to automate the entire process of building an ML model, from collecting data to deploying it in production. An ML pipeline is made up of five parts:
Data Cleaning: Data cleaning and standardizing data is a crucial aspect of machine learning pipelines. While data engineers do an initial cleaning, data scientists take it further to ensure accuracy and usefulness. This includes tasks such as ensuring consistent date formats, removing null values, and binning ages. They may also perform exploratory data analysis to identify patterns and trends. The role requires a deep understanding of both technical aspects of data cleaning and the broader context in which the data is used.
Feature engineering: Feature engineering is the process of selecting and extracting relevant features from raw data to improve the performance of machine learning models. To increase iteration speed, machine learning engineers will built out a feature store, essentially a few tables that are versioned and created for each model focus area i.e. areas like sales, acquisition, or the likes.
Model training: Model training typically begins with finding an appropriate machine learning algorithm(s). Once the algorithm has been selected, we proceed to train it on the data that we have feature engineered. This process usually involves iterating on the model architecture and hyperparameters until the desired level of performance is achieved. Finally, the trained model can be used to make predictions on new data.
Model evaluation: This phase involves assessing the performance of the trained ML model on a test dataset. When the model has been evaluated and is ready for deployment, we will often save it to a model registry.
Model Registry A model registry is a vital part of a company’s machine learning setup. It acts as a central hub to store details about the models that have been trained and deployed. It tracks not only the models themselves but also performance metrics for each model, and other metadata. This information is critical to ensure models can be reproduced and tracked throughout their lifecycle.
Model deployment: We will use the saved model (whether that is from a model registry or wherever stored model) and integrate into a production system where it can be used to make predictions on new data.
Model monitoring: The primary monitoring techniques are performance analysis monitoring and data drift monitoring.
Performance analysis monitoring involves setting up rules to identify whether a metric has failed within a certain time period, such as an hour, month, or other cadence. For example, accuracy above 80% is a good benchmark to aim for, but it’s also important to monitor other metrics such as accuracy if working with regression models or precision and recall if working with classification models. In addition to these checks, it’s helpful to keep track of any anomalies or patterns in the data that may impact the model’s performance over time.
The other primary way to monitor a model’s performance is to examine the underlying data. To do this, we need to compare the distribution of the data using statistical techniques and other methods.
All of these components are completed in a circular step, meaning we are starting from data and ending with reintroducing data and going through the cycle again.
Batch jobs process data or predict user behavior at regular intervals, storing the results. For data pipelines, this means taking data every 30 minutes or hour, while machine learning pipelines often make offline predictions or batch jobs. The latter predicts user preferences while they are not currently watching shows.
Data pipelines and ML pipelines have another one in common and that is on-demand pipelining.
In data pipelines, this often looks like the “push mechanism” where an event is created and the data from that event is pushed to a Kafka cluster.
In machine learning pipelines, this type of process is often referred to as an online inference. For example, when a user clicks on a website page that requires a new prediction to be made.
Another example is when you finish watching HBO Max’s Last of Us show and need a recommendation for a new show, whether it be along similar lines or different ones.
I have only seen one type of job that involves data pipelines: streaming. A streaming data pipeline is a system that processes and moves data in real-time from one place to another. While some may argue that this is similar to on-demand pipelining, I disagree since streaming data happens continuously, while on-demand pipelining happens sporadically.
Data pipelines have three types of jobs: streaming, batch, and on-demand. In contrast, ML pipelines only have batch and on-demand (or “inference”) job types.
Data and ML pipelines are critical components of every organization and can have similar computational requirements.
Data pipelines depend on the volume of incoming data, which ideally changes infrequently. However, sudden changes in data volume can happen, and it is important to ensure that the data pipelines can adjust accordingly. Otherwise, there will be a delay in data processing, which can negatively impact an organization’s operations.
On the other hand, ML pipelines depend on both the amount of data and the stage of the ML pipeline process. The computational needs of an ML pipeline change based on the stage of the pipeline, because training requires more compute power than processing or predicting. Therefore, we need to adjust the computational needs accordingly.
Regardless of whether we are using data or ML pipelines, it is critical to ensure that our pipelines can handle sudden and significant changes in compute requirements. This is especially important for data pipelines, where unexpected changes in data volume can cause a backlog of data that affects the organization’s ability to make informed decisions. Therefore, it is crucial to ensure that data and ML pipelines are scalable and can handle changes in compute requirements. Otherwise, these pipelines will fail and cause delays in data processing, resulting in a poor user experience.
Never Ending Pipelines
In conclusion, both data pipelines and ML pipelines are important for companies that deal with data. They have different goals, job types, and computational needs. Understanding these differences is important for building, scaling, and maintaining effective pipelines that can handle sudden changes in compute requirements and ensure accurate, timely, and actionable data for better decision-making and improved performance.