fbpx
Are All Monoliths Bad? Are All Monoliths Bad?
Editor’s note: Elliott Cordo is a speaker for ODSC East this April 23-25. Be sure to check out his talk, “Data... Are All Monoliths Bad?

Editor’s note: Elliott Cordo is a speaker for ODSC East this April 23-25. Be sure to check out his talk, “Data Pipeline Architecture – Stop Building Monoliths,” there!

The simple answer is no. When building any software system, when complexity is low, and engineering teams are small, a monolith may be a great place to start. In some cases, a more complex microservice implementation can be a pitfall, based on the same variables, and could be considered premature optimization.

However at a certain complexity monoliths start forming serious cracks in terms of system stability, and productivity.  Especially when you have large, often federated teams working on the same system.   

In-Person and Virtual Conference

September 5th to 6th, 2024 – London

Featuring 200 hours of content, 90 thought leaders and experts, and 40+ workshops and training sessions, Europe 2024 will keep you up-to-date with the latest topics and tools in everything from machine learning to generative AI and more.

 

Data Frustration

My career has been near equal parts of software and data engineering, although data has continually pulled me in. I absolutely love building data platforms from both the technical aspect, as well as the intimacy of working with every nook and cranny of a business’s data and processes.   

However, on the tech side of things, I’m often disappointed by the engineering maturity of the data platforms that are built. I’d say my beloved craft lags at least a decade behind more generalized software engineering. Yes, we have some new tools, which we largely consider to comprise the “Modern Data Stack,” however the way most organizations use them, and to some extent, the limitations of our tools have resulted in the building of centralized, fragile, monolithic architecture.

Major platform components become huge single points of failure as they host large portions or the entirety of analytic processing. Developer experience and productivity degrade as the code base becomes so large that it is difficult to make changes.  When developers do make changes they are at risk of merge or dependency conflicts, or introducing unintended bugs.

 The pain level for an Airflow environment with 500+ dags can be roughly equivalent to a bloated Django project with a similar number of modules. It immediately begs the question: does this all need to be one thing?

Team Organization

At one time the norm was the data stack would be wholly owned by one central team. Due to the difficulties of scaling to large teams, systems organizations started splitting up responsibilities vertically, ie. ingest team, data lake team, data quality team, and serving team.

Although this allowed organizations to organize into more realistically sized teams and project sizes it did not make work go any faster. This is due to the fact we now introduced cross-team dependencies and impedance for any data product outcome we wanted to drive. And the “smaller” platform components were still quite large (ie. a single ingest project, or data lake codebase).

Many organizations are now, in my opinion, rightly driving toward domain or outcome-based teams, that in addition to software development have the responsibility for data for data concerns from both a product and analytic perspective.   

As many of you probably have heard, a concept called data mesh has been gaining traction that attempts to address various technical and organizational concerns.

Technology: “Modern Data Stack”

There are many variations, but most consider the modern data stack a combination of Airflow, DBT, and a Cloud “Data Warehouse” engine (which would include “Data Lake”). These tools themselves are not necessarily to blame for Monolithic architecture, although features that support multi-project and federated development are still emerging.

Airflow

Large monolithic Airflow environments are quite common, and perhaps the most problematic.  As a Python project, things can get quite bloated pretty quick, and you can end up in dependency nightmares. The first question we should be asking ourselves is, does this have to be a single Airflow instance? Almost always the answer is no, as Airflow DAGs tend to be largely independent. It is very often the cause of inadequate investment in infrastructure deployment (IAC) and CI/CD making multiple deployments difficult.

As far as what you run inside of Airflow, there are plenty of options to keep the code base small and domain-specific. A really powerful tool is of course the Kubernetes Pod Operator, which allows you to remove code and logic from the Airflow environment and run it instead in a container.

See this post for additional details.

In-Person & Virtual Data Science Conference

October 29th-31st, 2024 – Burlingame, CA

Join us for 300+ hours of expert-led content, featuring hands-on, immersive training sessions, workshops, tutorials, and talks on cutting-edge AI tools and techniques, including our first-ever track devoted to AI Robotics!

 

DBT & Data Warehouse Engine

Just like Airflow, DBT and the Data Warehouse engine almost always become monolithic. Unfortunately, this area is a bit pricklier, both in tooling and organizational practice. DBT however, with some effort, can be implemented with multi-repo architecture, allowing separation around domain boundaries.

The Data Warehouse Engine itself, by default, tends to be a single environment, however, most platforms do enable both storage and processing separation and multi-warehouse architecture with data sharing. It’s more so a failure in planning and data governance (i.e. security, data contracts..), than a limitation of technology.

Learning More About Monoliths at ODSC East 2024

I hope these ideas and tips are helpful.  I look forward to diving deeper at my upcoming ODSC talk “Stop Building Monoliths.

About the Author

Elliott is an expert in data engineering, data warehousing, information management, and technology innovation with a passion for helping transform data into powerful information. He has more than a decade of experience implementing cutting-edge, data-driven applications. He has a passion for helping organizations understand the true potential in their data by working as a leader, architect, and hands-on contributor.

Elliott has built nearly a dozen cloud-native data platforms on AWS, ranging from data warehouses and data lakes, to real-time activation platforms in companies ranging from small startups to large enterprises.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1