7 Steps to Go From Data Science to Data Ops
Business + ManagementDataOpsFeatured PostTools & LanguagesConferencesDataOpsposted by Elizabeth Wallace, ODSC March 19, 2019 Elizabeth Wallace, ODSC
Not too long ago, data operation wasn’t on the radar, but now that it’s all people talk about, how can you move efficiently from data science to data ops? Gil Benghiat, co-founder of Data Kitchen, shares seven steps to do just that.
[Related Article: The Difference Between Data Scientists and Data Engineers]
Challenges of Data Science
Only a small fraction of your coding is specific to machine learning. The challenge of moving from data science to full-scale operations is all the other things around machine learning. You consider the business needs first and foremost before building any kind of pipeline. You spend a ton of time in data prep. Feature extraction requires another pipeline.
You get your model built, but now you have to test it for quality before deployment. Deployment itself is a critical challenge as well. The entire process is iterative, requiring feedback loops that continually look back to the original business need.
What is Data Ops?
Data ops involve two pieces, people/process/organization, and the technical aspects. Data ops actually begin within the Agile environment, which is primarily that first piece. You’re focusing on value here, providing data for your audience and providing context for the business.
Lean manufacturing and DevOps make up the technical environment. This is where the fundamentals of our seven steps lie. Going from data science to data operations requires a change in the way you think about your technical environment.
Seven Steps to Go From Data Science to Data Ops
So how do you logically go from simple data science to a full-scale data ops model that provides value for your business? Benghiat has a plan.
1. Orchestrate two journeys
Your data pipeline takes raw data and translates it into customer value. The steps move data and transform it into something that provides value. Otherwise, your unstructured data is useless, and you can’t be sure that your data is quality.
The second journey actually puts those ideas into production. Your raw data informs your creation and innovation, and the second journey takes the concept through the innovation pipeline into production.
2. Test Test Test
The software community automated production, going from periodic output to near constant production. You test both journeys in step one as data flows through the data pipeline and ideas flow into production.
As data comes in, it should be free from issues, and business logic based on that data should be sound as a result. Your outputs should be consistent, providing sanity tests to make sure that each action from idea to production is working.
Testing data isn’t just a pass/fail process. Testing needs to be capable of multiple types of indication, such as error (which brings the line to a halt), warning (which flags potential issues), and info (which gives you a list of proposed changes). These could be verifying inputs or testing business logic.
3. Use A Version Control System
You have a lot of tools, but at the end of the day, it’s all code. It needs a version control system to keep things together. Simple enough. Keeping a copy of each version maintains control as multiple members of the team are working and prevents accidental overwriting.
4. Branch And Merge
Once you’ve put your source control system in place, you can branch and merge. Each section branches off as you make changes and test. Once you know your section is good, you merge back into the whole system.
5. Use Multiple Environments
Analytic work requires coordination of all these separate tools. You must have an environment for each branch that’s separate from the production pipeline to keep your data clear. If you don’t hold your data constant, it changes through production, making data confusing and producing potential issues down the pipeline.
Some companies use dev areas separate from q/a, stage, and production. Each branch is static. Another could be spinning up an environment within the cloud.
6. Reuse And Containerize
Reusing code within data pipelines helps speed up production. The output of one pipeline could feed into the input of the next project. Developing the system requires knowing if the data is fresh, but reusing code and data keeps things humming along without having to extract data when it isn’t necessary.
7. Parameterize Your Processing
Your pipeline is one big function. You can vary inputs and outputs within the parameters or control the steps in the workflow. You increase efficiency and velocity, helping to move your data along into the production phase. You can reuse code but change parameters and run systems in parallel, for example, increasing the value your data brings into the production pipeline.
[Related Article: Data Ops: Running ML Models in Production the Right Way]
Deploying The Seven Steps
So how does this fit into the original pipeline? Except for the original business need, which is related to Agile instead of the technical aspects of your pipeline, these steps fall along the original steps of the workflow. Here’s how:
Benghiat has years of experience building Data Ops through Data Kitchen’s platform, and his seven steps grew from building the platform so that he could use the tools he wanted. Automating your own data is the way to finally move to Data Ops and begin providing the kind of value your business or organization needs.