7 Steps to Go From Data Science to Data Ops 7 Steps to Go From Data Science to Data Ops
Not too long ago, data operation wasn’t on the radar, but now that it’s all people talk about, how can you move efficiently from... 7 Steps to Go From Data Science to Data Ops

Not too long ago, data operation wasn’t on the radar, but now that it’s all people talk about, how can you move efficiently from data science to data ops? Gil Benghiat, co-founder of Data Kitchen, shares seven steps to do just that.

[Related Article: The Difference Between Data Scientists and Data Engineers]

Challenges of Data Science

Only a small fraction of your coding is specific to machine learning. The challenge of moving from data science to full-scale operations is all the other things around machine learning. You consider the business needs first and foremost before building any kind of pipeline. You spend a ton of time in data prep. Feature extraction requires another pipeline.

 

 

 

 

You get your model built, but now you have to test it for quality before deployment. Deployment itself is a critical challenge as well. The entire process is iterative, requiring feedback loops that continually look back to the original business need.

What is Data Ops?

Data ops involve two pieces, people/process/organization, and the technical aspects. Data ops actually begin within the Agile environment, which is primarily that first piece. You’re focusing on value here, providing data for your audience and providing context for the business.

Lean manufacturing and DevOps make up the technical environment. This is where the fundamentals of our seven steps lie. Going from data science to data operations requires a change in the way you think about your technical environment.

 

 

Seven Steps to Go From Data Science to Data Ops

So how do you logically go from simple data science to a full-scale data ops model that provides value for your business? Benghiat has a plan.

1. Orchestrate two journeys

Your data pipeline takes raw data and translates it into customer value. The steps move data and transform it into something that provides value. Otherwise, your unstructured data is useless, and you can’t be sure that your data is quality.

The second journey actually puts those ideas into production. Your raw data informs your creation and innovation, and the second journey takes the concept through the innovation pipeline into production.

2. Test Test Test

The software community automated production, going from periodic output to near constant production. You test both journeys in step one as data flows through the data pipeline and ideas flow into production.

As data comes in, it should be free from issues, and business logic based on that data should be sound as a result. Your outputs should be consistent, providing sanity tests to make sure that each action from idea to production is working.

Testing data isn’t just a pass/fail process.  Testing needs to be capable of multiple types of indication,  such as error (which brings the line to a halt), warning (which flags potential issues), and info (which gives you a list of proposed changes). These could be verifying inputs or testing business logic.

3. Use A Version Control System

You have a lot of tools, but at the end of the day, it’s all code. It needs a version control system to keep things together. Simple enough. Keeping a copy of each version maintains control as multiple members of the team are working and prevents accidental overwriting.

4. Branch And Merge

Once you’ve put your source control system in place, you can branch and merge. Each section branches off as you make changes and test. Once you know your section is good, you merge back into the whole system.

5. Use Multiple Environments

Analytic work requires coordination of all these separate tools. You must have an environment for each branch that’s separate from the production pipeline to keep your data clear. If you don’t hold your data constant, it changes through production, making data confusing and producing potential issues down the pipeline.

Some companies use dev areas separate from q/a, stage, and production. Each branch is static. Another could be spinning up an environment within the cloud.

6. Reuse And Containerize

Reusing code within data pipelines helps speed up production. The output of one pipeline could feed into the input of the next project. Developing the system requires knowing if the data is fresh, but reusing code and data keeps things humming along without having to extract data when it isn’t necessary.

7. Parameterize Your Processing

Your pipeline is one big function. You can vary inputs and outputs within the parameters or control the steps in the workflow. You increase efficiency and velocity, helping to move your data along into the production phase. You can reuse code but change parameters and run systems in parallel, for example, increasing the value your data brings into the production pipeline.

[Related Article: Data Ops: Running ML Models in Production the Right Way]

Deploying The Seven Steps

So how does this fit into the original pipeline? Except for the original business need, which is related to Agile instead of the technical aspects of your pipeline, these steps fall along the original steps of the workflow. Here’s how:

Benghiat has years of experience building Data Ops through Data Kitchen’s platform, and his seven steps grew from building the platform so that he could use the tools he wanted. Automating your own data is the way to finally move to Data Ops and begin providing the kind of value your business or organization needs.

Watch the full talk here!

 

Elizabeth Wallace

Elizabeth Wallace, ODSC

Elizabeth is a Nashville-based freelance writer with a soft spot for startups. She spent 13 years teaching language in higher ed and now helps startups and other organizations explain - clearly - what it is they do. Connect with her on LinkedIn here: https://www.linkedin.com/in/elizabethawallace/

1