Data accessibility and availability is a persistent sticking point within organizations that want to generate value from their data. Step 1. Get data. Step 2. ??? Step 3. Fancy models and profit! Right? **crickets** Yeah, there are some holes in this scheme, so let’s talk about that Step 2 bit for a moment and how schedulers can help.
Just a few years ago, companies of all sorts were trying to figure out how to collect data about their business. In many industries, this problem has been solved only too well! We now have data coming out of our ears! Petabytes here, terabytes there. But what is in this data, and how do we turn it into big bags of cash?
Ideally, all organizations should be making their data useful in some fashion. However, we frequently spend a bunch of time making sure we have stored our data securely, but don’t give sufficient thought to generating value from it. After all, why are you storing the data if not to use it later? Surely you don’t just like paying your data storage bill for fun!
This is among the problems that I’m suggesting you can solve with a good robust scheduling tool, and some scalable data storage. In my ODSC Europe talk, I’ll be demonstrating this idea with Apache Airflow and AWS data storage, specifically Redshift. However, the core principles are really not specific to the tools you pick; rather, the key is to conceptualize your data and your goals appropriately.
For example, let’s talk about the stakeholders in this problem for a company that wants to apply sophisticated machine learning to predicting a business outcome. Data management is one function, but data science and machine learning is another. Most organizations will have different people, in different roles/departments, handling these two elements.
Using a scheduled pipeline of data transfer allows you to link these two functions, and move your data where it needs to be when it is required. The two departments need only agree on some expectations regarding the way data will be formatted upon transfer, e.g. which fields and types to expect, and then your pipeline can make sure data gets moved without any human intervention. Internalizing the fact that safe, careful data mobility is just as important as the start and endpoints of this process is the real linchpin.
This is an alternative to some bad strategies, such as both teams having to use the exact same data storage solution. By doing this, you eliminate the need to move your data, sure, but you can count on no one getting the best tool for their actual needs. If one team’s primary concern is cheap and high volume storage, with minimal or no data manipulation, and the other team is worried about quick, user-friendly data querying and extraction, with flexibility for writing and augmentation, you’re going to be in a bind choosing one singular storage option. Maybe you can do it! But you don’t have to.
Another example is in machine learning itself. Much of what we do in modeling involves extracting data from someplace, engineering features tailored to the business outcome, and either training a model or making predictions using an existing model. Combining the scheduler + robust data store toolkit I’m proposing makes this really convenient, and as my talk title states, will enable you to make your modeling team much happier.
While your modelers can probably build bespoke python pipelines for feature engineering, and run them with cron, this is a brittle and low-visibility method for managing your business-critical infrastructure. Especially given the fact that model explainability and interpretability is crucial to really getting business value (if you can’t explain it, how are you going to sell the leadership on using it?), the visibility piece is key. A solid scheduler, like Airflow, can watch your back and make sure things are running as expected even when no one is watching. Set up whatever tests and checks are relevant to you, and make use of the readable logs, automatic alerts sent to your on-call system, and retries/error handling a scheduling tool can offer.
Further, it is likely very valuable for debugging and testing for you to maintain the engineered features and be able to revisit them later – this requires writing data back into your data store as part of the job, another point where you need your scheduler and data storage to be working in sync. When you have a scalable data store that can swiftly consume the data that your scheduled jobs are creating, the task of maintaining those features is vastly simplified. Then your data store is accessible to data scientists for future testing, iteration, and model improvements.
These are only a handful of the ways that a scheduler + database infrastructure can be a huge boost for data science team architecture. While much of the glamor and exceptional business value in data science may be found in model building, getting the data where it needs to be at the right time, in a way your team can use, is table stakes for achieving that. If you’re interested in seeing the nuts and bolts of how my team at Journera applies Apache Airflow and AWS Redshift to carrying out this plan, please join me at ODSC Europe in September in my talk, “Making Happy Modelers: Build and Maintain Your Data Warehouse with AWS Redshift and Airflow.”
About the author/speaker: Stephanie Kirmer
Stephanie Kirmer is a Data Science Technical Lead at Journera, an early-stage startup that helps companies in the travel industry use data efficiently and securely to create better travel experiences for customers. Previously, she worked as a Senior Data Scientist at Uptake, where she developed tools for analyzing diesel engine fuel efficiency, and constructed predictive models for diagnosing and preventing mechanical failure. Before joining Uptake, she worked on data science for social policy research at the University of Chicago and taught sociology and health policy at DePaul University.