Automating Machine Learning: Just How Much? Automating Machine Learning: Just How Much?
Many businesses are interested in deploying machine learning, predictive analytics, or AI to gain the upper hand and make the right... Automating Machine Learning: Just How Much?

Many businesses are interested in deploying machine learning, predictive analytics, or AI to gain the upper hand and make the right decisions. This can be a struggle as finding the right technology and trustworthy experts is both expensive and complex. Even when the technology is provided, deploying and automating machine learning is also time consuming, each time trying different options and starting over, meeting after meeting.

One solution could be to automate the whole data science process (CRISP-DM) from reading the data to the final deployment. No need for expensive experts and their time. Just a single tool to train, test, and deploy models from a fancy UI. Such solutions are already available in some form and are accompanied by the inevitable debate.

On the one hand, supporters of fully automated data science cycles insist that such a tool should automatically access the data, clean them and prepare them, then train and test a preselected machine learning model, possibly optimize its hyperparameters, and finally deploy the best performing model.

On the other hand, are the automation deniers who insist that the data science process needs experimentation and manual expert care throughout all steps: data exploration; data cleaning and preparation; selection of the most suitable machine learning model, from a number of different ML algorithms and architectures, maybe even after implementing, comparing, and combining some of them; optional hyperparameter optimization; simple testing and/or testing based on resampling techniques via specific error metrics; result investigation for insight discovery or possible mistakes in the process design; and final deployment of one or more of the trained models.

In this dichotomy, where do you stand?

In reality, some data science projects can benefit from full automation, while others need constant expertise and research to determine the best solution. However, most data science projects lie in-between: a few steps can be comfortably automated, while others need expert intervention. It would be nice to be able to introduce a few strategically located interaction points throughout your whole data science process. We call this approach Guided Automation, as it automates most of the process, but still allows for some interaction by the expert user.

An interaction point is a way for the expert to interact with the application and refine or change direction in the data science process. After all, we are not all expert about everything. Enabling the end user to inject their specific expertise in strategic points of the process can only benefit the final result.

The final data science application could run from a web browser; interaction points could be web pages where the application stops and waits for input from the expert user.

In KNIME Analytics Platform, you can strategically place special nodes throughout your data processing and data science workflows. These special nodes generate web pages as interaction points when running the workflow on a web browser. In this way, you can, for example, ask the expert user for more complex feature engineering, to select the machine learning models to train, or the final execution platform if more than one is available. We call this approach – of creating web pages as interaction points for the process – Guided Analytics.

With Guided Analytics and Guided Automation, the whole data science process becomes more open, transparent and user-friendly. It is up to you to decide how much interaction is needed and what should be customized when automatically training models.

Do you want to learn how to build such a workflow? Attend our upcoming talk at ODSC East this April 30 to May 3, “Guided Analytics Learnathon: Building Applications for Automated Machine Learning.”


Presenters Bio:

Paolo Tamagnini currently works as a data scientist at KNIME.  Paolo holds a master degree in data science and research experience in data visualization techniques for machine learning interpretability.







Scott Fincher works for KNIME, Inc as a Data Scientist. He has presented several talks on KNIME’s open source Analytics Platform, and enjoys assisting other data scientists with optimizing and deploying their models. Prior to his work at KNIME, he worked for almost 20 years as an environmental consultant,with a focus on numerical modeling of atmospheric pollutants. He holds an MS in Statistics and a BS in Meteorology, both from Texas A&M University.


Paolo Tamagnini

Paolo Tamagnini contributed to this article. He is a data scientist at KNIME, holds a master’s degree in data science from Sapienza University of Rome and has research experience from NYU in data visualization techniques for machine learning interpretability. Follow Paolo on LinkedIn: https://www.linkedin.com/in/paolo-tamagnini/