Reports like the Standish Group’s CHAOS paper point to the continually abysmal failure rate of software projects. Abysmal rates vary by report, of course, but all paint the same bleak picture. Most software projects fail by some critical measure, time, budget, quality or requirements.
In the context of data science and AI projects, can we assume the success rate will be any higher for production projects? Ignoring the fact that various types of data science projects from statistical analysis to deep learning are completely different, a quick and dirty breakdown of the typical data science project structure will give us some insight:
- Identifying and sourcing data for disparate data sources
- Cleaning, transforming, and wrangling data into usable training,streaming, modeling data and other prepping
- Training, feature selection, model selection and model performance
- Infrastructure, scalability, repeatability performance, degradation, reporting etc.
Generally speaking, the data science stack is more complex. Not a good sign for project success.
Let’s look at the human side of the equation in terms of collaboration and communication. Typically we need the following on a team:
- Data scientists,
- data architects,
- software architects,
- software developers,
- database/lake admins,
- project managers and
- Operations managers
These are only to name a few roles.
Given the complexity of production quality of data science projects, can we expect an equal or better rate of success? Not likely; unless the way we approach data science projects changes.
The software profession has partly solved this with DevOps. At its core, DevOps requires close and agile coordination and collaboration between between PMs, development, QA, operations teams and others.
What about data science? DataOps has been championed in the big data community for a few years now but is less well known in the data science/AI community. DataOps for data science borrows the key elements from DevOps – close coordination between data scientists, developers, operations, infrastructure and anyone else involved in the build, deploy, repeat cycles of a project.
The use case is simple. Any deep learning, machine learning, or predictive model requires constant monitoring and attention. Model performance weakens over time, data quality and data load varies, hot swaps may be required. All require coordination amongst teams to get anything close to a well performing production environment. The data science profession is maturing as projects become more critical to the enterprise and new roles like DataOps will emerge.
This year’s the Open Data Science Conference (ODSC) will gather over 4,000 data scientist, developers, data wranglers to Boston on May 1st, 2018 for four days. As an applied data science conference for professionals, we understand the importance and challenges of production quality data science. This year we have a new focus area title “Management, Practice, and DataOps in Data Science.” Some relevant talks include:
- Creating a Data-Driven Product Culture
- How to Go From Data Science to Data Operations
- Standardized Data Science: The Team Data Science Data Process
- Monitoring AI Applications with AI
- Data Science State of the Union, 2018
Join us for other focus areas including Deep Learning, and Machine Learning, Data Visualization, Data Science Research, Data Visualization, Business and Innovation, and others.
- Modeling Classification Trees 119 views | by Diego Lopez Yse | under Machine Learning, Modeling
- Why You Should be Using Jupyter Notebooks 95 views | by Daniel Gutierrez, ODSC | under Machine Learning, Modeling
- Discovering 135 Nights of Sleep with Data, Anomaly Detection, and Time Series 27 views | by Juan De Dios Santos | under Modeling, Python, R, Statistics, Tools & Languages