Should You Build or Buy Your Data Science Platform? Should You Build or Buy Your Data Science Platform?
An increasingly common theme among many vendors exhibiting solutions at conferences for data scientists is AutoML data science platforms. It seems like every company... Should You Build or Buy Your Data Science Platform?

An increasingly common theme among many vendors exhibiting solutions at conferences for data scientists is AutoML data science platforms. It seems like every company has their own slant for how to automate the machine learning process. There are some pretty impressive and innovative products out there and I reviewed the state of the industry back on January 31, 2019. In this article, we’ll update the state-of-the-art and pose an important question on the minds of many data scientists: “Should I use an auto ML platform, should I build my own, or should I just keep on doing machine learning like I’ve always done it?”

 

Auto ML Platforms Abound

I’ve taken some time to survey the big data ecosystem of companies to come up with an extensive list of the most prevalent data science & ML platforms that offer some sort of automation. Each one has a somewhat different take on what data scientists need to help them in their work. The common theme is to stretch the skills of the data scientist to address the shortage of data science expertise.

[Related article: 10 Best Data Science Platforms]

Auto ML/Data Science Platform Description
BigML BigML is a consumable, programmable, and scalable machine learning platform that makes it easy to solve and automate common ML algorithms.
Binah.AI Binah delivers complex AI-based data science expertise.
Determined AI AutoML at scale. Speed up model development by 100x via distributed training and best-in-class hyperparameter search.
Domino Data Lab Domino provides an open, unified platform to build, validate, deliver, and monitor models at scale.
dotData End-to-end data science automation platform allowing you to automate your entire data science process.
Dotscience Dotscience makes data science teams more productive, by enabling collaboration, flexible access to high performance compute, and version control.
Dataiku From prototyping ML-based pipelines to deploying scalable AI services across the enterprise. Save time with data access, data pre-processing, feature engineering, and model training and testing.
DataRobot Enabling the AI-driven enterprise with automated machine learning
Google Cloud AutoML Beta. Train high-quality custom machine learning models with minimum effort and machine learning expertise. AutoML translation, natural language, and vision.
Gramener Gramex is a no-code data science platform to automate insights as stories. Generate insights using machine learning
H2O H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit.
Iguazio Data science and analytics PaaS. Accelerating data science from exploration to production.
Imaginea AI The Imaginea Artificial Intelligence Ecosystem is a scalable platform designed to put precision AI technologies in the hands of every organization across the globe.
John Snow Labs AI platform
KNIME Software KNIME Analytics Platform is the open source software for creating data science applications and services.
Logical Clocks Hopsworks is a Data and Compute Platform for AI. It is a Python-first platform that supports the design and operation of end-to-end machine learning (ML) pipelines, written fully in Python.
MissingLink MissingLink helps data engineers streamline and automate the entire deep learning lifecycle.
Periscope Data The Periscope Data platform gives data professionals full control over the analytics lifecycle — including ingestion, storage, analysis, visualization, and reporting.
R2.ai R2 Learn is an AutoML product that enables enterprises of all sizes to have ML development capabilities.
RapidMiner RapidMiner is a software platform for analytics teams that unites data prep, machine learning, and predictive model deployment.
SigOpt Black-box hyperparameter optimization solution automates model tuning to accelerate the model development process and amplify the impact of models in production at scale.
TROVE TROVE offers data wrangling and curation to help companies unlock data and apply sophisticated AI models to make data-driven decisions that deliver measurable business results.
World Programming WPS Analytics is a powerful and versatile software platform for scalable data manipulation and analytics.

Tradeoffs for Building Your Own?

Deciding whether to adopt the methodology and overarching philosophy of a commercial auto ML tool requires much consideration. This is a strategic business decision and more often than not, the decision maker is going to be the Chief Data Scientist (or equivalent) of the organization.

Many data scientists will consider tinkering as an option for coming up with a custom auto ML solution manually from scratch. Data scientists like to build things, so it’s natural that they might try to take a stab at automating the tried-and-proven methods already being used. This is not to say such a solution will compete with a product developed by a well-funded start-up, but it may be satisfactory for the problems at hand.

It may come down to a question of time and cost. Is building such a solution going to take a lot of time and expense that may take away from the company’s primary business? Of course, this path must be weighed against the time and cost involved with evaluating, buying, and adopting a commercial solution.

[Related article: Business Applications of AI-Powered NLP]

Data science platform vendors offer a number of essential functions such as:

  • They connect to multiple data scores and provide traditional ETL functionality.
  • They allow you to run machine learning, deep learning, NLP, and other traditional models to some degree.
  • They can display or input the model results to another system.
  • Some platforms are able to deploy to a production environment and have staging, testing, etc. built in.

 

All this sounds pretty good, but there are a few caveats to consider:

  • The issue of “vendor lock-in” is a real concern. What happens to customers if the company is acquired, or goes under?
  • Despite many of the platforms having APIs, and open source connectivity, they are frameworks that are very good at what they do but are generally inflexible otherwise.
  • Some vendors are slow to act on new trends. You can be more nimble by building your own. For example, when deep learning became hot and before Tensorflow or MXNet, may of the existing platforms like RapidMiner, H2O, DataRobot had no deep learning capabilities and some still do not.

 

It is for these reasons that data scientists may choose not to buy a commercial solution and instead work to build a custom solution for the following reasons:

  • Cost – the platform pricing may be prohibitive.
  • Flexibility – there is a desire to avoid vendor lock-in, or the framework is too narrowly focused. Using scikit-learn, Tensorflow, etc., it’s possible to build a model that is full-featured and is more suited to the data scientist’s company workflow.
  • Domain expertise – many of these platforms are great for generic problems, however, if the need is to find solutions for a very specific problem in a specific domain then it might be better to build a custom tool.
  • Existing infrastructure and expertise – if you’re running a Java shop or an R shop and have deployed a specific data warehouse/data lake, you may want to build out a solution to leverage your existing technology stack.

 

There may be a 3rd, hybrid approach that many companies use, specifically they employ some kind of vendor platform with a custom code system to suit their specific use cases or problem domain.

 

Conclusion

In summary, the build-or-buy dilemma centers on capital and resources. When you decide to buy, you’re optimizing your existing team’s bandwidth, you don’t have to build out admin capability like authentication and authorization etc., you have a stable and secure software platform, and when someone quits then the knowledge doesn’t leave with them. On the other hand, when you decide to build it’s exciting to build your own platform, the platform will be much more attuned to your company workflow, and you can leverage open source tools with no license fees. Of course, you could opt to continue doing data science and machine learning the “old fashioned” way with tried and proven manual methods.

Editor’s note: Interested in learning more about possible data science platforms and solutions? Attend the ODSC East AI Solution Showcase Expo & Demo Hall in Boston this May 1-2 and see what’s available and how to use them!

Daniel Gutierrez, ODSC

Daniel Gutierrez, ODSC

Daniel D. Gutierrez is a practicing data scientist who’s been working with data long before the field came in vogue. As a technology journalist, he enjoys keeping a pulse on this fast-paced industry. Daniel is also an educator having taught data science, machine learning and R classes at the university level. He has authored four computer industry books on database and data science technology, including his most recent title, “Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R.” Daniel holds a BS in Mathematics and Computer Science from UCLA.