Visualizing Machine Learning Datasets with Google’s FACETS Visualizing Machine Learning Datasets with Google’s FACETS
There has been a lot of uproar as to how a large quantity of training data can have a tremendous impact... Visualizing Machine Learning Datasets with Google’s FACETS

[Related Article: Creating Multiple Visualizations in a Single Python Notebook]

A Machine Learning dataset sometimes consists of data points ranging from thousands to millions which in turn may contain hundreds or thousands of features. Additionally, real-world data is messy comprising of missing values, unbalanced data, outliers etc. Therefore it becomes imperative that we clean the data before proceeding with model building. Visualizing the data can help in locating these irregularities and pointing out the locations where the data actually needs cleaning. Data Visualization gives an overview of the entire data irrespective of its quantity and helps to perform EDA in a fast and accurate manner.


The dictionary meaning of facets boils down to a particular aspect or feature of something. In the same way, the FACETS tool helps us to understand the various features of data and explore them without having to explicitly code.

Facets is an open-source visualization tool released by Google under the PAIR(People + AI Research) initiative. This tool helps us to understand and analyze the Machine Learning datasets. Facets consist of two visualizations, both of which help to drill down the data and provide great insights without much of work at user’s end.

  • Facets Overview

As the name suggests, this visualization gives an overview of the entire dataset and gives a sense of the shape of each feature of the data. Facets Overview summarizes statistics for each feature and compares the training and test datasets.

  • Facets Dive

This feature helps the user to dive deep into the individual feature/observation of the data to get more information. It helps in interactively exploring large numbers of data points at once.

These visualizations are implemented as Polymer web components, backed by Typescript code and can be easily embedded into Jupyter notebooks or web pages.

Usage & Installation

There are two ways in which FACETS can be used with data:

Within Jupyter Notebooks/Collaboratory

It is also possible to use FACETS within Jupyter Notebook/Colaboratoty. This gives more flexibility since the entire EDA and modeling can be done in a single notebook. Please refer to their Github Repository for complete details on installation. However later in the article, we will see how to get going with FACETS in colab.


Although you can work with the data provided on the demo page, I shall be working with another set of data. I will be doing EDA with FACETS on the Loan Prediction Dataset. The problem statement is to predict whether an applicant who has been granted a loan by a company, will repay it back or not. It is a fairly known example in the ML community.

The dataset which has already been divided into Training and Testing set can be accessed from here. Let’s load in our data into the Colab.

import pandas as pd
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

Now lets us understand how we can use Facets Overview with this data.

FACETS Overview

The Overview automatically gives a quick understanding of the distribution of values across the various features of the data. The distribution can also be compared across the training and testing datasets instantly. If some anomaly exists in the data, it just pops out from the data there and then.

Some of the information that can be easily accessed through this feature are:

  • Statistics like mean, median, and Standard Deviation
  • Min and Max values of a column
  • Missing data
  • Values that have zero values
  • Since it is possible to view the distributions across the test dataset also, we can easily confirm if the training and testing data follow the same distributions.

One would argue that we can achieve these tasks easily with Pandas and why should we invest into another tool. This is true and maybe not required when we have few data points with minimum features. However, the scenario changes when we are talking about a large dataset where it becomes kind of difficult to analyse each and every data point in multiple columns.

Google Collaboratory makes it very easy to work since we do not need to install additional things. By writing a few lines of code our work gets done.

# Clone the facets github repo to get access to the python feature stats generation code
!git clone https://github.com/pair-code/facets.git

To calculate the feature statistics, we need to use the function GenericFeatureStatisticsGenerator() which lies in a Python Script.

# Add the path to the feature stats generation code.
import sys
sys.path.insert(0, '/content/facets/facets_overview/python/')# Create the feature stats for the datasets and stringify it.
import base64
from generic_feature_statistics_generator import GenericFeatureStatisticsGeneratorgfsg = GenericFeatureStatisticsGenerator()
proto = gfsg.ProtoFromDataFrames([{'name': 'train', 'table': train},
                                  {'name': 'test', 'table': test}])
protostr = base64.b64encode(proto.SerializeToString()).decode("utf-8")

Now with the following lines of code, we can easily display the visualization right in our notebook.

# Display the facets overview visualization for this data
from IPython.core.display import display, HTMLHTML_TEMPLATE = """<link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/master/facets-dist/facets-jupyter.html" >
        <facets-overview id="elem"></facets-overview>
          document.querySelector("#elem").protoInput = "{protostr}";
html = HTML_TEMPLATE.format(protostr=protostr)

As soon as you type Shift+Enter, you are welcomed by this nice interactive visualization:

Here, we see the Facets Overview visualization of the five numeric features of the Loan Prediction dataset. The features are sorted by non-uniformity, with the feature with the most non-uniform distribution at the top. Numbers in red indicate possible trouble spots, in this case, numeric features with a high percentage of values set to 0. The histograms at the right allow you to compare the distributions between the training data (blue) and test data (orange).

The above visualization shows one of the eight categorical features of the dataset. The features are sorted by distribution distance, with the feature with the biggest skew between the training (blue) and test (orange) datasets at the top.


Facets Dive provides an easy-to-customize, intuitive interface for exploring the relationship between the data points across the different features of a dataset. With Facets Dive, you control the position, color and visual representation of each data point based on its feature values. If the data points have images associated with them, the images can be used as visual representations.

To use the Dive visualization, the data has to be transformed into JSON format.

# Display the Dive visualization for the training data.
from IPython.core.display import display, HTMLjsonstr = train.to_json(orient='records')
HTML_TEMPLATE = """<link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/master/facets-dist/facets-jupyter.html">
        <facets-dive id="elem" height="600"></facets-dive>
          var data = {jsonstr};
          document.querySelector("#elem").data = data;
html = HTML_TEMPLATE.format(jsonstr=jsonstr)

After you run the code, you should be able to see this:

Facets Dive Visualisation

Now we can easily perform Univariate and Bivariate Analysis and let us see some of the results obtained:

Univariate Analysis

Here we will look at the target variable, i.e., Loan_Status and other categorical features like gender, Marital Status, Employment status and Credit history, independently. Likewise, you can play around with other features also.


  • Most of the applicants in the dataset are male.
  • Again a majority of the applicants in the dataset are married and have repaid their debts.
  • Also, most of the applicants have no dependents and are graduates from semi-urban areas.

Now let’s visualize the ordinal variables i.e Dependents, Education, and Property Area.

Following inferences can be made from the above bar plots:

  • Most of the applicants don’t have any dependents.
  • Most of the applicants are Graduate.
  • Most of the applicants are from Semiurban area.

Now you can continue your analysis with the numerical data.

Bivariate Analysis

We will find the relationship between the target variable and categorical independent variables.

It can be inferred from the above bar plots that:

  • The proportion of married applicants is higher for the approved loans.
  • Distribution of applicants with 1 or 3+ dependents is similar across both the categories of Loan_Status.
  • It seems people with credit history as 1 are more likely to get their loans approved.
  • The proportion of loans getting approved in the semiurban area is higher as compared to that in rural or urban areas.


FACETS provides an easy and intuitive environment to perform EDA for datasets and helps us derive meaningful results. The only catch is that currently it only works with Chrome.

[Related Article: Data Visualization for Academics]

Before ending this article, let us also see a fun fact highlighting how a small human labelling error in CIFAR-10 dataset was caught using the FACETS Dive. While analyzing the dataset it came to notice that an image of a frog had been incorrectly labelled as a cat. Well, this is indeed some achievement since it would be an impossible task for a human eye.


Originally posted here. Reposted with permission.

Parul Pandey

Parul is a Data Science Evangelist at H2O.ai. She combines Data Science, evangelism and community in her work. Her emphasis is to break down the data science jargon for the people. Prior to H2O.ai, she worked with Tata Power India, applying Machine Learning and Analytics to solve the pressing problem of Load sheddings in India. She is also an active writer and speaker and has contributed to various national and international publications including TDS, Analytics Vidhya and KDNuggets and Datacamp.