fbpx
The Future of the Single Source of Truth is an Open Data Lake The Future of the Single Source of Truth is an Open Data Lake
Editor’s note: Christina is a speaker for ODSC East 2024 this April 23-25. Be sure to check out her talk, “Dive... The Future of the Single Source of Truth is an Open Data Lake

Editor’s note: Christina is a speaker for ODSC East 2024 this April 23-25. Be sure to check out her talk, “Dive into Data: The Future of the Single Source of Truth is an Open Data Lake,” there!

Introduction: The Rise of ELT

Organizations that strive for high-performance data systems are increasingly turning towards the ELT (Extract, Load, Transform) model. This approach provides scalability, flexibility, and low latency: ELT allows for horizontal scaling of the target data store and enables users to perform analysis and transformations on the fly. However, choosing the right provider and target can be a daunting task, as it carries high stakes and high rewards. A successful implementation can deliver significantly faster insights, but a poorly executed one can result in costly infrastructure maintenance and vendor lock-in.

The Proliferation of EL Technology

Are you torn between building and buying software for your EL needs? Join the club. With limited engineering resources and a tight GTM timeline, my team started with a vendor. It helped us scale up to hundreds of SAAS connectors quickly, but the cost model and lack of customization became problematic over time. (I wrote about our learning curve here.) We decided to explore other options for our ingestion needs Airbyte and Meltano. I made a side-by-side comparison in the appendix.

I am not going to recommend the “best“ technology to you because there is no silver bullet. I do want to introduce a framework that can help you evaluate your opinions quantitatively, instead of making decisions based on what is popular right now. It is called Analytic Hierarchy Process, and it was a game-changer for my team.

Remember, developers are a key part of this equation. So don’t forget to consider their satisfaction with the technology. After all, happy developers deliver for successful projects and maintain the stack for the long term. 

The One True Destination

With the ability to accommodate both structured and unstructured data, data lakes are the perfect fit for a wide range of analytics, machine learning, and product use cases. Modern data lakehouses take full advantage of the infinite scalability of cloud storage, while also offering a range of advanced features such as ACID transactions, time travel, and schema evolution.  An open data lake, rather than proprietary targets, is the inevitable EL destination. By choosing open formats, you can dramatically reduce costs, prevent vendor lock-in, have control over your data, and freedom to innovate. By decoupling compute from storage, you can substantially improve data pipeline efficiency. (I shared my experience in this post.)

Be aware that an EL framework’s write-to-target is often overly general and inefficient. This is because the primary objective of EL is to deliver bronze data, not to optimize the user’s compute resources. Although it is possible to make the process more efficient, doing so requires additional infrastructure. Unfortunately, this is not always transparent to the user, especially when the EL provider is a closed-source service. In such cases, the user has to trust the vendor’s autoscaling mechanism, which may not always work optimally. 

Here is an example from Airbyte’s Redshift connector documentation. The default – insert data into columnar storage as row operations – is less optimal. 

The Redshift Destination offers two replication strategies. The first inserts via a typical SQL Insert statement. Although less efficient, this requires less user setup. 

The second inserts via streaming the data to an S3 bucket, and Copy-ing the data into Redshift. This is more efficient, and recommended for production workloads, but does require users to set up an S3 bucket and pass in additional credentials.

It is often beneficial to separate the process of writing to a target from data extraction, by using cloud storage as the initial landing zone. You can create a data extractor (e.g. a REST API service) that asynchronously pulls data from the source in increments. The service then stores the data in its raw format in S3, without performing heavy-duty data transformation. Instead, complex transformation tasks are delegated downstream to the silver/gold layers.

The Open Format Contest

Open table formats such as Delta, Iceberg, and Hudi have grown tremendously in popularity. All of them use Parquet, an open-source, high-performance, columnar file format, as the underlying storage layer. However, the metastores, which store data type, structure, and statistical information, are still vendor-specific. Using Delta on one of the most advanced enterprise data warehouses is still unreliable and considered experimental; the best-managed Spark compute platform does not yet natively support Iceberg. The battle continues.

Although overshadowed by challenges, hope finds a way: One promising step forward is the Uniform Project. Uniform enables users to work with open tables and access multiple clients, regardless of format. However, it is important to note that there are limitations to the current reader/writer versions, which may result in performance issues. I am excited about the development of these kinds of projects and welcome contributions from interested individuals.

Conclusion: The Hope for Future Data Lakes

We love modern EL technology and its ability to offer greater flexibility and faster insights. However, I believe that the productivity of these tools relies heavily on the skills of the developers who maintain them. From my perspective, cloud storage is the most reliable EL destination since it eliminates the need for inefficient transformations, vendor-specific “black boxes,” and proprietary targets. I foresee data science and analytics moving towards high-throughput, low-latency open tables. I look forward to a future where open format wars are a thing of the past and metastores are unified.

Would you like to join me in a dialogue on this topic at Open Data Science East 2024?

Appendix – ELT Choices Comparison

Vendor Airbyte (self-host) Meltano (self-host)
Origin Closed source Airbyte (2020) GitLab (2018)
Source Several hundreds of connectors and growing Hundreds of connectors, custom connector development kit Hundreds of connectors, customer REST API, and tap development kit
Target Support S3 (limited formats), data lake (requires metastore), data warehouses (requires separate compute) Support S3 in Avro, Parquet, Json Line. Response is converted multiple times. Multiple compaction strategies. Support S3 in Json and Parquet formats. Automatic compaction.
Ease of Use Easy to get started, restricted customization Easy to get started, has an UI to help iterate through configuration Steeper learning curve. Once learned, easier to overwrite classes and methods.
Customization Subject to product roadmap Low code SDK, SDK, UI
Core is written in Java
Custom SDK
Written in Python
Speed Depending on target compute Can be slow; read source code for implementation and data type conversion. Faster; write your own code for more efficiency.
Transformation Separate setup with formulas or dbt core/cloud With limitations With limitations
Support Paid support plan Active community Slack Active community slack

About the Author:

Christina is passionate about open source, multi-cloud, scalable and efficient data pipelines. She makes data informed architectural decisions to build modern systems that support advanced analytics, machine learning, and customer facing product use cases. She has a keen interest in interdisciplinary areas such as DevOps, MLOps and Cloud FinOps. She loves to learn, share and contribute to the open source community.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1