fbpx
A Deep Dive into H2O’s AutoML A Deep Dive into H2O’s AutoML
The demand for machine learning systems has soared over the past few years. This is majorly due to the success of Machine Learning techniques... A Deep Dive into H2O’s AutoML

[Related Article: What Do Managers and Decision Makers Need to Know About AutoML?]


Automated Machine Learning: AutoML

Automated machine learning (AutoML) is the process of automating the end-to-end process of applying machine learning to real-world problems. AutoML tends to automate the maximum number of steps in an ML pipeline—with a minimum amount of human effort—without compromising the model’s performance.

Aspects of Automated Machine Learning
  • Automating certain parts of data preparation, e.g. imputation, standardization, feature selection, etc.
  • Being able to generate various models automatically, e.g. random grid search, Bayesian Hyperparameter Optimization, etc.
  • Getting the best model out of all the generated models, which most of the time is an Ensemble, e.g. ensemble selection, stacking, etc.

H2O’s Automatic Machine Learning (AutoML)

Features of H2O
  • Availability of core algorithms in high-performance Java. including APIs in R, Python, Scala, web GUI.
  • Seamlessly works on Hadoop, Spark, AWS, your laptop, etc.

Who is it for?

H2O’s AutoML can also be a helpful tool for the novice as well as advanced users. It provides a simple wrapper function that performs a large number of modeling-related tasks that would typically require many lines of code. This essentially frees up the time to focus on other aspects of the data science pipeline, such as data preprocessing, feature engineering, and model deployment.

AutoML Interface

H2O AutoML has an R and Python interface along with a web GUI called Flow. The H2O AutoML interface is designed to have as few parameters as possible so that all the user needs to do is to point to their dataset, identify the response column and optionally specify a time constraint or limit on the number of total models trained.

H2O AutoML is available in R, Python, and a web GUI.

Installation

H2O offers an R package that can be installed from CRAN and a Python package that can be installed from PyPI. In this article, we shall be working with the Python implementation only. Also, you may want to look at the documentation for complete details.

pip install requests
pip install tabulate
pip install "colorama>=0.3.8"
pip install future
pip install h2o

H2O AutoML functionalities

H2O’s AutoML is equipped with the following functionalities:

  • Trains a Random grid of algorithms like GBMs, DNNs, GLMs, etc. using a carefully chosen hyper-parameter space.
  • Individual models are tuned using cross-validation.
  • Two Stacked Ensembles are trained. One ensemble contains all the models (optimized for model performance), and the other ensemble provides just the best performing model from each algorithm class/family (optimized for production use).
  • Returns a sorted “Leaderboard” of all models.
  • All models can be easily exported to production.

Case Study

Predicting Material Backorders in Inventory Management using Machine Learning

Image by marcin049 from Pixabay

Methodology

The basic outline for this Machine Problem will be as follows.

import h2o
from h2o.automl import H2OAutoML
h2o.init(max_mem_size='16G')
data_path = "https://github.com/h2oai/h2o-tutorials/raw/master/h2o-world-2017/automl/data/product_backorders.csv"# Load data into H2O
df = h2o.import_file(data_path)
df.head()
A sample of the dataset
print(f'Size of training set: {df.shape[0]} rows and {df.shape[1]} columns')
-------------------------------------------------------------
Size of training set: 19053 rows and 23 columns
splits = df.split_frame(ratios=[0.8],seed=1)
train = splits[0]
test = splits[1]
y = "went_on_backorder" 
x = df.columns 
x.remove(y) 
x.remove("sku")
aml = H2OAutoML(max_runtime_secs=120, seed=1)
aml.train(x=x,y=y, training_frame=train)
  • max_models: Specify the maximum number of models to build in an AutoML run, excluding the Stacked Ensemble models. Defaults to NULL/None.

Leaderboard

Next, we can view the AutoML Leaderboard. The AutoML object includes a “leaderboard” of models that were trained in the process, including the 5-fold cross-validated model performance (by default).

lb = aml.leaderboard
lb.head()
# Get model ids for all models in the AutoML Leaderboard
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])# Get the "All Models" Stacked Ensemble model
se = h2o.get_model([mid for mid in model_ids if "StackedEnsemble_AllModels" in mid][0])# Get the Stacked Ensemble metalearner model
metalearner = h2o.get_model(se.metalearner()['name'])
%matplotlib inline
metalearner.std_coef_plot()
Plotting the base learner contributions to the ensemble.
pred = aml.predict(test)
pred.head()
h2o.save_model(aml.leader, path="./product_backorders_model_bin")

Conclusion

[Related Article: 7 Top Data Science Trends in 2020 to Be Excited About]

Essentially, the purpose of AutoML is to automate the repetitive tasks like pipeline creation and hyperparameter tuning so that data scientists can spend more of their time on the business problem at hand. AutoML also aims to make the technology available to everybody rather than a select few. AutoML and data scientists can work in conjunction to accelerate the ML process so that the real effectiveness of machine learning can be utilized.

Originally Posted Here

Parul Pandey

Parul Pandey

Parul is a Data Science Evangelist at H2O.ai. She combines Data Science, evangelism and community in her work. Her emphasis is to break down the data science jargon for the people. Prior to H2O.ai, she worked with Tata Power India, applying Machine Learning and Analytics to solve the pressing problem of Load sheddings in India. She is also an active writer and speaker and has contributed to various national and international publications including TDS, Analytics Vidhya and KDNuggets and Datacamp.

1