In 2021 we watched Fivetran raise 150 Million, Matillion 16 million and Informatica went public.
All of these companies have some piece of their business connected to data pipelines. Also sometimes referenced as ETL, ELT, E(t)LT, and CDC.
For today when I say data pipeline I am focused on batch processing and what you need to consider when building batch data pipelines.
Regardless of the tools, you are using.
When Building Pipelines What Should You Consider
Tools and technology are just that.
They won’t actually drive any form of impact on their own.
They won’t develop processes that are connected to dashboards that in turn drive actions without people. Nor are the numbers they are creating going to magically jump off the screen and fix a business.
So before building any data pipeline it’s important to consider a few things.
🤔 What Is This Data Being Used For?
Understanding business context is important for engineers. Let me say that again.
Understanding business context is important for engineers.
As a data engineer or ETL developer knowing what initiative you’re building for can both be a motivator as well as help you make better design decisions. So an important step in developing a data pipeline is understanding its purpose is and what is it driving.
Is it going to be used for a fraud detection system, some standard KPIs, a new model meant to increase sales, etc? These are direct outputs you can connect to your work. Personally, it makes me feel much better when I can concretely connect my data to actual data products, dashboards, models, and research. It means I played an important role.
But that means you need to ask…what is this data for?
🎯 Is All The Data Valid?
Probably one of the most important things to consider prior to building a data pipeline. Is the data even valid?
Or, is the specific fields you are pulling even valid. It’s not uncommon for tables to have fields that are either not supported or just plain wrong but the bulk of the table is right. In these cases, it is important to not bring in the wrong data.
Because it will be used by someone who doesn’t know it’s wrong.
⏰ How Often Will You Pull Data?
Data that isn’t streaming or real-time often can be pulled at some regular interval. This depends on:
- When the data is needed?
- How much data being pulled?
- How often the data changes?
It’s easy and very common to create pipelines that run at midnight every day. This is probably how 50% of pipelines are set up(made up statistic). Whether it be CRON or some scheduler this method makes it easy to manage in terms of long-term maintenance.
However, there are a lot of reasons to implement more frequent data pulls. For example, if you need a report at a certain time of day every day. Then, it might make sense to do hourly pulls or even have real-time data.
On the other hand, if you only need a report once a week, then likely daily is fine.
What I Wish I Knew Before I Became A Data Engineer
Of course, if a data pipeline takes 20 hours to run, then you might also need to consider running the pipeline more frequently in smaller batches(or up your computing). Truthfully, there could be a few different ways you could improve performance before increasing frequency.
The point here is you will need to ask your business stakeholders, how often will you be looking at this data.
📈 Incremental, Total extract, Historical Updates
Generally, data is usually extracted in a few different ways.
Either you can do 100% table pulls which means you don’t need to figure out what data changed. This is by far the easiest but can become expensive or time consuming.
You can do incremental data loads if data is purely appended and you can do a historical merges where you only pull newly inserted or updated data and in turn merge it with the old data.
Each of these methods have varying levels of difficulty. Starting with a complete table pulling being the easiest and the historical merge being the hardest. Especially if you’re using Redshift in 2014.
In order to know which of these methods you should do, you will need to understand how your data is being populated inside your source system and how much data there is.
👨💼👩💼 Who will manage the pipeline
An important question to ask when making data pipelines is who is going to manage the pipeline’s long-term.
Pipelines don’t run smoothly 100% of the time. They can often fail or have bad data come through. There needs to be a clear understanding of who owns the pipelines so that in the case of failures, someone is there to fix said pipelines. Otherwise, you end up with a lot of neglected pipelines that no one wants to take care of.
Challenges You Will Face When Building Data Pipelines
Building data pipelines, especially batch pipelines that are tightly coupled to data sources is not easy. I banged my head a lot when building SSIS integrations. If one field changed, the entire pipeline would always freak out.
The truth is there are a lot of challenges you will face when building data pipelines and even more when building a modern data stack.
- Change in data formats over time.
- Increase in data velocity and volume.
- Rapid changes on data source credentials.
- Null issues.
- Change requests for new columns, dimensions, derivatives, and features.
- Writing source-specific code which tends to create overhead to future maintenance of ETL flows.
Some of these challenges can be mitigated by solutions that offer connectors like Fivetran and Airbyte as they are tasked with keeping up with API changes. However, some of the other challenges such as increasing data size can be tricky depending on how you are pulling data.
The important thing to note is that these issues are almost all on the maintenance side of building pipelines.
Honestly, building pipelines has gotten considerably easier. The hard part comes with the fact that there are so many changing needs everywhere that everything from columns being removed, bad data being inserted, and ad-hoc requests being made can bog down a team.
This is why it is important to only create the pipelines you need and can manage. Each pipeline will inevitably create a few smaller tasks every so often.
Building Better Data Pipelines
However you build your data pipeline, whether it’s with code, low code, no code, or likely some combination of the three, it’s important that you understand the context around it. The tools you use matter, but more importantly, just building pipelines for the sake of building them isn’t a good idea.
Instead, data engineers should be asking questions like what is the data for, who will use it, how often do we need it are all important. Most business users will often just ask for data and think they will figure out what they will do with it later.
So it is not uncommon for data engineers to be forced to drive the conversation in terms of “Why”. Why do you need this data?
Article originally posted here by Ben Rogojan. Reposted with permission.