Editor’s note: Check out Ido’s talk at ODSC East 2019 this May, “From Zero to Airflow: Bootstrapping Into a Best-in-Class Risk Analytics Platform.”
The tipping point
Many organizations reach the point in which new goals for SLA, scale, or efficiency simply exceed the capabilities of their existing data / ML infrastructure. From a data science perspective, this usually means a degradation in the ability to manage and monitor existing models in production, as well as a decreased capacity to address the core needs of the business.
It then becomes a top concern for all involved to select a new infrastructure model that will allow the business to meet its goals. However, it turns out to be a difficult issue to approach, especially given the myriad of variants out there. Open source or commercial? Batch or streaming? Fully managed or self-hosted? While these are clearly important considerations, there are others that are more conducive for orienting the space around the ability to run ML models in near-real or real-time:
- What runtime or latency can each solution deliver?
- Which solution is the easiest to implement and transition into, given the existing technological stack?
- Which solution would be the easiest to use and maintain daily by data science, data engineering, devOPs, or any other affected team?
Some popular solutions
A simple bash script that executes commands on a preset frequency. This solution has simplicity going for it. The mechanism is easy to understand and implement and has virtually no overhead. However, it becomes difficult to manage as the number of tasks increases or as tasks become interdependent since there is no obvious way to externalize dependencies or monitor them. There is also no native interface to enable use by those who are less tech-savvy, and tasks cannot be run at a frequency higher than one minute (unless you are open to some creative trickery). Cronjobs are naturally geared for batch operations.
These are essentially frameworks both for organizing tasks into networks of dependencies and for executing those networks. The immediate gain is that tasks with complex structures are much easier to understand, monitor, and execute at scale. Despite a learning curve, these frameworks are usually relatively user-friendly and support many different levels of engagement. On the other hand, they do require some significant overhead: Repos for managing the workflow code, servers, and databases for running the scheduler, and yet more servers to act as workers/executors for the tasks themselves. Typically, the supported latencies are in order of seconds or minutes. Two popular open source and self-hosted solutions are Apache Airflow and Luigi; both are geared for batch operations and can integrate relatively well with an existing batch operation tech stack.
Instead of consuming broad cross-sections of data at a given frequency, streaming platforms allow for consumption of data continuously via rolling time windows and virtually upon arrival. The key concept is that data is consumed and analyzed before it is stored or transformed, and the storage element is repurposed mostly for overflow and other secondary applications. The classic model begins with a stream manager that publishes incoming data streams and allows other elements to subscribe to them (e.g. Apache Kafka). From there, streams are consumed by various processors that are optimized for distributed operations and high data volumes (e.g. Apache Flink / Storm, Amazon Kinesis, MapR, Spark Streaming). Some really impressive latencies can be achieved this way, i.e. seconds or milliseconds, but for many organizations, this kind of switch is nothing short of a paradigm shift.
A real-world case study
At BlueVine, we were faced with this exact dilemma. We knew that we wanted to achieve a response time of a few minutes and also allow for a robust organization, monitoring, and control of every task. The challenge was further compounded by the wide range of tasks to manage: API communication with dozens of data vendors, data transformation processes, hundreds of low-level calculations, and multiple layers of interconnected models. We also wanted to ensure that performing the transition from a system of cronjobs would not cause any downtime and would not negatively impact the division of labor and responsibilities between the various data teams. These considerations eventually led us to choose Apache Airflow` as our new platform, which we fully implemented in mid-2018.
Learn more about our transition process, the problems we faced, the effect it had on the various data teams, and some real-world insights about the strengths and limitations of Airflow during my presentation at ODSC East 2019, “From zero to Airflow: Bootstrapping into a best-in-class risk analytics platform.”