fbpx
What is Query Driven Data Modeling? What is Query Driven Data Modeling?
If you missed it, I had a live show a few weeks ago with one of the co-authors from Fundamentals Of Data... What is Query Driven Data Modeling?

If you missed it, I had a live show a few weeks ago with one of the co-authors from Fundamentals Of Data Engineering defined the concept of query-driven data modeling.

Now he was referencing the standard data modeling concepts which are:

  1. Conceptual
  2. Logical
  3. Physical
  4. but adding the 4th — Query-Driven-Modeling

According to Reis, many companies no longer perform the first two and often just go straight to developing queries to support a one-off request. Thus we are likely not even using any form of star schema or even asking ourselves whether we are building a data mart or data warehouse.

Instead, we are likely just building tables to support a single request. Or, once again, as Joe called it, Query-Driven-Modeling.

But what is query-driven modeling, and does it have a place in the data world?

Query-Driven-Modeling

Source: Author

Now, there isn’t an official definition for QDM, or JIT-data models, although we did in the past have the term ‘schema-on-read,” which is somewhat similar.

But I think many of us know what it is.

It’s the building of tables or data sets in reaction to a single stake-holder’s request. There may be some form of requirements gathering, but only for the one stakeholder.

Benefits of Query-Driven Modeling

Now, like any choice you make as an engineer, there are pros and cons.

  • Speed To Initial Insights — The clearest benefit of developing under a query-driven approach is time-to-insights (at least in the short term).
  • Self-Service — Tools like dbt have been a great door into data for many teams. They have also helped speed up the ability for analysts and those who are SQL proficient to go from query to table. In turn, this has also led teams to reach the “Holy Grail” of self-service considerably faster. When I first started in the data world, I recall reaching out to the EDW team with a query I had built and needed implementing. I had to wait 3–4 months to see it deployed…as a view. This wasn’t anything crazy, either. Many analytics teams likely can’t operate effectively in that environment and, in turn, will likely create a shadow IT team and data warehouse anyway.
  • Stakeholders Will Be Happy In The Short Run — Many data teams get stuck in loops of constantly developing JIT because stakeholders often face pressure to do so. Managers and directors need numbers to give to their VPs, VPs need numbers to give to the C-Suite, and the C-suite needs numbers to give to the board. And the constant downward pressure pushes down on newly minted analysts or over-worked data engineers who want to/are forced to deliver what their managers ask.

Thus, using query-driven modeling is the answer. You can query from raw data sets, run a few gut checks on the data, and boom.

Answers.

Everyone is happy.

Cons Of Query-Driven Modeling

There is always a cost for teams deciding to use a JIT approach to their data models. And to be clear, whether this is a conscious decision or not, it is made. There are clear trade-offs.

  • Decreased Agility When Changes Are Required — Although development time is faster, the ability to pivot and change what a system is doing will worsen as more models, dependencies, and tech debt are built up. Each new weak link in the complex system that is developed leads to a possible failure point that no one might be aware of.
  • Lack of Coherent Metrics — When data teams take on a JIT approach, there is a good chance that the same metrics will be developed by several teams using slightly different methodologies, leading to the classic issue where team numbers don’t match. Many companies I have seen, including FB, in turn, create some form of metrics layer or portal that would help define key metrics and call out where they are being used.
  • Spaghetti Pipelines — When JIT is heavily relied on, the data pipeline systems that are created can quickly become spaghetti pipelines. You’ll find yourself 18 DAGs in only to figure out that the customer_category field you thought came from source A was actually populated by Destination B and eventually made its way back to Source A (I have seen it).
  • Less Robust Data Sets — Besides the risk of self-dependency. Another issue is the fact that making even what might be perceived as a small change to one pipeline might have large consequences. Since no table is defined as core-per-say and governance is minimal, it’ll be hard to fully understand the weight of a change. Now we do create solutions for this, such as data lineage; but to some degree, if you’re overly relying on data lineage to track down how important a data set is, you likely already have an issue.
  • Cost — When I first started in the data world, my stakeholders told me they wanted real-time data for their dashboards. So I clicked “live data” on my Tableau dashboards and published them.

Well, as you can likely imagine, if you’ve been in the data world for even two or three years, it wasn’t long before I had two consultants come down and call this out as causing a decently large spike on the servers. The truth is tools like Tableau and dbt solve problems, but they can also cause new ones. They are so easy they can often increase not only the speed of insights but also the speed of bad decisions and computer usage, leading to greater costs.

With all those cons, does Query-Driven Modeling have a place in the data world?

Do Query-Driven Models Have Any Place In The Data World

I do think there are places for a more JIT approach to creating data sets. This is where the role of an Analytics Engineer really does make sense for some teams.

When companies have larger data teams and clear initiatives per department regarding how they want to use data, using a JIT approach combined with a more traditional core data model makes sense.

This was what we essentially had at Facebook.

The team I worked on created what I considered “Data As Infra.” That is to say the tables we built were relied upon by many other teams at Facebook. Other data engineers were building their insights and analytics based on what we built.

It got to the point where we wouldn’t be able to manually go through and understand the impact of making small changes. This meant we had to treat our data more like infrastructure, not just like a table you can change because of a single stakeholder’s request.

However, if you went one team away from us, you’d find a combination of very technical analysts or business-focused data engineers needing to create more one-off data sets supporting clear business asks (Now, this occasionally made it difficult to directly point to our impact since it was all so encompassing). But treating core tables more as infrastructure with policies and governance in place to ensure a small change doesn’t break all of the dependencies makes sense. Obviously, this is not a perfect model, and I have seen plenty of small organizations run fine with JIT modeling and others that use OBT, but none that really discuss conceptual modeling.

But generally speaking, you’ll eventually need to pay for the tech debt, whether that’s with a code re-write or continuing down the path and finding you reach a point where this meme is reality.

Source: Chad Sanderson

If you are in a place where you need help assessing your data infra or data strategy, then set up a consultation today!

This happens all the time in many modern data teams often because analysts and data scientists are forced to develop their own data pipelines. They generally focus on answering a specific set of requests.

Not developing robust data sets can support multiple use cases.

Design Is An Afterthought

Building data sets and workflows exactly when required with little to no requirements gathering has its benefits the same way extreme prototyping has its benefits. But for every benefit, there are tradeoffs.

Data sets become part of complex dependency trees, the business requests constant changes, and, in turn, the queries are constantly deteriorated to the point where they are likely maintaining far more in terms of usage weight than initially tested for.

Take, for example, a beam being put under new stresses in a building.

Eventually, it’ll collapse and likely take out several other beams with it.

Is query-driven modeling a type of data modeling? Sure.

But you better know the risks.

Article originally posted here. Reposted with permission.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1