In their paper, Tabular Data: Deep Learning is Not All You Need, the authors argue that while deep learning methods have shown tremendous success in the image and text domains, traditional tree-based methods like XGBoost still continue to shine when it comes to tabular data. The authors examined Tabnet, Neural Oblivious Decision Ensembles (NODE), DNF-Net, and 1D-CNN deep learning models and compared their performance on eleven datasets with XGBoost.
Tabular Data: Deep Learning is Not All You Need | source: https://arxiv.org/pdf/2106.03253.pdf
This is an important paper in the sense that it reiterates the fact that deep learning may not be the silver bullet for solving all machine learning problems. On the other hand, tree-based algorithms have been shown to perform at par or even outperform neural networks for tabular data while being simple to use and comprehend.
And there is good news for people who like to work with tree-based models. A few months back, the TensorFlow Decision Forests, aka TF-DF library, was open-sourced by Google. In this article, we’ll understand what TF-DF is and how it could be helpful for us.
Many great resources and code examples have been made available as part of the documentation( refer to the References section below). Hence, I will not reinvent the wheel. This article is not a guide to get started but rather a quick overview of the library to showcase its main ideas and features. For an in-depth deep dive, the article by Eryk Lewinson on using TensorFlow Decision Forests using the Pokemon datasets is recommended.
TensorFlow Decision Forests (TF-DF)
Decision Forests(DF) is a class of machine learning algorithms made up of multiple decision trees. Random Forests and Gradient Boosted Decision Trees are the two most popular DF training algorithms. The TensorFlow Decision forests is a library created for training, serving, inferencing, and interpreting these Decision Forest models. TF-DF is basically a wrapper around the C++ Yggdrasil Decision Forests(YDF) library making it available in TensorFlow.
TF-DF provides a unified API for both tree-based models as well as neural networks. This is incredibly convenient for users since they can now use a unified API for the neural networks as well tree-based models.
TF-DF library can be easily installed with pip. However, it is still not compatible with either Mac or Windows. For non-Linux users, using it via Google Colab could be a workaround.
Let’s look at the basic implementation example of using TF-DF on Palmer’s Penguins dataset. This dataset is a drop-in replacement for the iris dataset, and the goal is to predict the penguin species from the given features.
First five rows of the dataset
As you can see, the dataset is a mix of numerical and categorical features and is a classic example of a classification machine learning problem. Training decision forest in TensorFlow is very intuitive, as can be seen in the example below. The code has been taken from the official documentation.
Training a Tensorflow Decision Forest | Image by Author
A lot of things stand out. Notably, no preprocessing like one-hot encoding and normalization is required. We’ll touch upon them in the next section.
TF-Decision Forests stands out on several fronts. Let’s briefly discuss a few of them:
Highlights of TF Decision Forests | Image by Author
Ease of Use
- The same Keras API can be used for neural networks as well as trees based algorithms. It is possible to combine decision forests and neural networks to create new types of hybrid models.
- No need to specify input features. TensorFlow Decision Forests can automatically detect the input features from the dataset.
Automatic detection of input features by TF Decision Forests | Image by Author
- No preprocessing like categorical encoding, normalization, and missing value imputations is required.
- No validation dataset is required. If provided, the validation dataset will only be used for displaying metrics.
Easy deployment options with TensorFlow Serving
- After the model is trained, you can evaluate it on a test dataset using
model.evalute()or make predictions with
model.predict(). Finally, you can save the model in the
SavedModelformat to be served just like any TensorFlow model using TensorFlow Serving.
Serving via TensorFlow Serving | Image by Author
- Sometimes, it becomes imperative that we understand how a model works under the hood for high-stakes decisions. The TensorFlow Decision Forests have inbuilt plotting methods to plot and help understand the tree structure.
Here is the plot of the first tree of our Random Forest model.
tfdf.model_plotter.plot_model_in_colab(model_1, tree_idx=0, max_depth=3)
Interactive visualization | Image by Author
Additionally, one can also access the model structure and feature importance along with the training logs.
Scope for Improvement
Many useful features come packaged with TF Decision Forests, but there are also some areas for improvement.
- No direct support for Windows or macOS(till date).
- As of now, only three algorithms are available in the TF DF module: Random Forests, Gradient Boosting Trees, and CART Model.
All available learning algorithms in TF Decision Forests library | Image by Author
- Currently, there is also no support for running the models on GPU/TPU infrastructure.
Final word & Resources to get started
Overall, TF Decision Forests provide an excellent option for building tree-based models with Tensorflow and Keras. It is especially convenient for those who have an existing TensorFlow pipeline in place. This library is under constant development, so a lot more features should be expected soon. If you want to look at code examples and use the library for your use cases, there are some excellent resources for all levels.
- TensorFlow Decision Forests tutorials
- TensorFlow Decision Forests project on GitHub.
- Official Blogpost
Article originally posted here. Reposted with permission.