Visualizing Decision Trees with Pybaobabdt

Data VisualizationModelingposted by ODSC Community September 2, 2022

Decision trees can be visualized in multiple ways. Take, for instance, the indentation nodes where every internal and leaf node is depicted as...

Decision trees can be visualized in multiple ways. Take, for instance, the indentation nodes where every internal and leaf node is depicted as text, while the parent-child relationship is shown by indenting the child with respect to the parent.

Indentation diagram| Image by Author

Then there is the node-link diagram. It is one of the most commonly used methods to visualize decision trees where the nodes are represented via glyphs, and parent and child nodes are connected through links.

Icicle plots are another option for the same. In addition to displaying the relationship, these plots also help depict the node size. They derive their name from the fact that the resulting visualization looks like icicles.

An icicle plot by https://www.cs.middlebury.edu/~candrews/showcase/infovis_techniques_s16/icicle_plots/icicleplots.html | CC-BY license

While these techniques are helpful, they do not scale well especially when the size of data increases. In such situations, not only does it become difficult to visualize the data, but interpreting and understanding the tree is also a challenge. BaobabView is a library created to overcome such problems, and in this article, we’ll look at its python implementation called pybaobabdt in detail, along with examples.

We initiated the article by discussing the multiple ways of visualizing decision trees. It’ll also be worthwhile to look at various libraries that help plot decision trees.

Dataset

We’ll use Palmer’s Penguins dataset as a common dataset here. It is a well-known dataset and is typically a drop-in replacement for the iris dataset, and the goal is to predict the penguin species from the given features.

First five rows of the dataset | Image by Author

`1. Visualization using sklearn.tree`.plot_tree

This is the default way and the most commonly used method. It is available as the default option with scikit-learn.

`Visualization using sklearn.tree`.plot_tree | image by Author

The `max_depth` of the tree has been limited to 3 for this example.

`Visualization using sklearn.tree`.plot_tree | image by Author

2. `Visualization using `dtreeviz

The dtreeviz library renders better-looking and intuitive visualizations while offering better interpretability options. The library derives its inspiration from the educational animation by R2D3A visual introduction to machine learning.

`Visualization using `dtreeviz | Image by Author

`Visualization using `dtreeviz | Image by Author

3. `Visualization using `TensorFlow Decision Forests (TF-DF)

The TensorFlow Decision forests is a library created for training, serving, inferencing, and interpreting these Decision Forest models. It provides a unified API for both tree-based models as well as neural networks. The TensorFlow Decision Forests have inbuilt interactive plotting methods to plot and help understand the tree structure.

Link to articleReviewing the TensorFlow Decision Forests library

`Visualization using `TensorFlow Decision Forests (TF-DF) | Image by Author

`Visualization using `TensorFlow Decision Forests (TF-DF) | Image by Author

A paper titled BaobabView: Interactive construction and analysis of decision trees showcases a unique technique for visualizing decision trees. This technique is not only scalable but also enables experts to inject their domain knowledge into the construction of decision trees. The method is called BaobabView and relies on the three critical aspects of visualization, interaction, and algorithmic support.

BaobabView’s three critical aspects of visualization, interaction, and algorithmic support | Image by Author

Here is an excerpt from the paper which highlights this point concretely:

We think our tool provides a double example of a visual analytics approach. We show how a machine learning method can be enhanced using interaction and visualization; we also show how manual construction and analysis can be supported by algorithmic and automated support.

What’s in the name?

Are you wondering about the strange name? Well, the term has its roots(pun intended) in the Adansonia digitata or the African baobab due to its uncanny resemblance to the tree structure.

Ferdinand Reus from Arnhem, HollandCC BY-SA 2.0, via Wikimedia Commons

The pybaobabdt package is a python implementation of the BaobabView. Let’s now get a little deeper into the specifics of this library starting with its installation.

Installation

The package can be installed as follows:

`pip install pybaobabdt`

However, there are a few requirements that need to be fulfilled:

• Python version ≥ 3.6
• PyGraphviz
• Popular python packages like sklearn, numpy, pygraphviz, matplotlib, scipy, pandas should also be installed.

Pybaobabdt in action

We’ll continue with our penguins’ dataset and build a decision tree to predict the penguin species from the given features.

```from sklearn.tree import DecisionTreeClassifier
y = list(df['Species'])
features = list(df.columns)
target = df['Species']
features.remove('Species')
X = df.loc[:, features]clf = DecisionTreeClassifier().fit(X,y)```

The code above initializes and trains a classification tree. Once that is done, the next task is to visualize the tree using the `pybaobabdt` package, which can be accomplished in just a single line of code.

`ax = pybaobabdt.drawTree(clf, size=10, dpi=300, features=features, ratio=0.8,colormap='Set1')`

Visualizing decision tree classifier using Pybaobabdt package | Image by Author

There you go! You have a decision tree classifier, where every class of species is represented with a different color. In the case of a Random Forest, it is also possible to visualize individual trees. These trees can then be saved to higher resolution images for in-depth inspection.

The pybaobabdt library also offers a bunch of customizations. I’ll showcase a few of them here:

Colormaps

pybaobabdt supports all matplotlib colormaps. We have seen how a `Set1` colormap looks like, but you can choose from many different options. Here are how few of them appear when used:

Decision tree visualization with Pybaoabdt with different colormaps | Image by Author

But you are not limited to the available colormaps. You can even define one of your own. Let’s say we want to highlight just one specific class in our dataset while keeping all the others in the background. Here’s what we can do:

```from matplotlib.colors import ListedColormapcolors = ["green", "gray", "gray"]
colorMap = ListedColormap(colors)ax = pybaobabdt.drawTree(clf, size=10, features=features, ratio=0.8,colormap=colorMap)```

Highlighting only a specific class in the decision tree | Image by Author

Ratio

The ratio option is used to set the ratio of the tree where the default value is 1. Here’s a comparison of the two ratios and how they appear on the screen.

`ax = pybaobabdt.drawTree(clf, size=10, dpi=300, features=features, ratio=0.5,colormap='viridis')`

How different ratios affect figure size | Image by Author

maxdepth=3

The parameter `maxdepth` controls the depth of the tree. A lower number limits the tree splits and also shows the top splits. If the `max_depth` of the above tree is set to 3, we’ll get a stunted tree:

`ax = pybaobabdt.drawTree(clf, size=10, maxdepth = 3,features=features, ratio=1,colormap='plasma')`

Adjusting the maximum depth of the tree to control the tree size | Image by Author

The output graph can be saved as follows:

`ax.get_figure().savefig('claasifier_tree.png', format='png', dpi=300, transparent=True)`

Conclusion

The pybaobabdt package offers a fresh perspective on visualizations. It includes features that have not been seen in its counterparts. The main idea is to help the users understand and interpret the tree through meaningful visualizations. This article used a straightforward example to demonstrate the library. However, it’ll be an excellent exercise to use it for much more extensive and complex datasets to see its strength in the real sense. I’ll leave that as an exercise for the readers.

Article originally posted here. Reposted with permission.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1