Modeling Regression Trees

Machine LearningModelingregressionposted by Diego Lopez Yse July 7, 2020

Decision Trees (DTs) are probably one of the most popular Machine Learning algorithms. In my post “The Complete Guide to Decision Trees”,...

Decision Trees (DTs) are probably one of the most popular Machine Learning algorithms. In my post The Complete Guide to Decision Trees”, I describe DTs in detail: their real-life applications, different DT types and algorithms, and their pros and cons. I’ve detailed how to program Classification Trees, and now it’s the turn of Regression Trees.

Regression Trees work with numeric target variables. Unlike Classification Trees in which the target variable is qualitative, Regression Trees are used to predict continuous output variables. If you want to predict things like the probability of success of medical treatment, the future price of a financial stock, or salaries in a given population, you can use this algorithm. Let’s see an implementation example in Python.

The Problem

The Boston Housing dataset consists of the price of houses in various places in Boston, USA. Alongside with their price, this dataset provides information such as crime level, areas of non-retail business in the town, the age of people who own the house, and other attributes.

The variable called ‘MEDV’ indicates the prices of the houses and is the target variable. The rest of the variables are the predictors based on which we will predict the value of the house.

The Steps

You can cut down the complexity of building DTs by dealing with simpler sub-steps: each individual sub-routine in a DT will connect to other ones to increase complexity, and this construction will let you reach more robust models that are easier to maintain and improve. Now, let’s build a Regression Tree (a special type of DT) in Python.

Load data and describe the dataset

Loading a data file is the easy part. The problem (and most time-consuming part) usually refers to the data preparation process: setting the right data formats, dealing with missing values and outliers, eliminating duplicates, etc.

Before loading the data, we’ll import the necessary libraries:

```import pandas as pd
from pandas_datareader import data
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import r2_score```

Now we load the dataset and convert it to a Pandas Dataframe:

```boston = datasets.load_boston()
df = pd.DataFrame(boston.data)```

And name the columns:

```df.columns = boston.feature_names
df[‘MEDV’] = boston.target```

First, understand the dataset and describe it:

`print(boston.DESCR)df.info()`

Nice: 506 records, 14 numeric variables, and no missing values. We don’t preprocess the data and we’re ready to model.

Select features and the target variable

You need to divide your given columns into two types of variables: dependent (or target variable) and independent variable (or feature variables). In our example, variable “MEDV” (the median value of owner-occupied homes) is the one we’re trying to predict.

```X = df.iloc[:,0:13].copy()
y = df.iloc[:,13].copy()```

Split the dataset

To understand model performance, dividing the dataset into a training set and a test set is a good strategy. By splitting the dataset into two separate sets, we can train using one set and test using another.

• Training set: this data is used to build your model. E.g. using the CART algorithm to create a Decision Tree.
• Testing set: this data is used to see how the model performs on unseen data, as it would in a real-world situation. This data should be left completely unseen until you would like to test your model to evaluate performance.

Next, we split our dataset into a 70% train and a 30% test.

`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)`

Build DT model and finetune

Building a DT is as simple as this:

`rt = DecisionTreeRegressor(criterion = ‘mse’, max_depth=5)`

In this case, we only defined the splitting criteria (choose mean squared error) and defined only one hyperparameter (the maximum depth to which the tree will be built). Parameters that define the model architecture are referred to as hyperparameters and thus, the process of searching for the ideal model architecture (the one that maximizes the model performance) is referred to as hyperparameter tuninghyperparameter is a parameter whose value is set before the learning process begins, and they can’t be directly trained from the data.

You can take a look at the rest of the hyperparameters you can tune by calling the model:

`rt`

Models can have many hyperparameters and there are different strategies for finding the best combination of parameters. You can take a look at some of them on this link.

Train DT model

Fitting your model to the training data represents the training part of the modeling process. After it is trained, the model can be used to make predictions, with a predict method call:

`model_r = rt.fit(X_train, y_train)`

Test DT model

A test dataset is a dataset that is independent of the training dataset. This test dataset is the unseen data set for your model which will help you generalizing it:

`y_pred = model_r.predict(X_test)`

Visualize

One of the biggest strengths of DTs is their interpretability. Visualizing DTs is not only a powerful way to understand your model, but also to communicate how your model works:

```from sklearn import tree
import graphviz
dot_data = tree.export_graphviz(rt, feature_names=list(X), class_names=sorted(y.unique()), filled=True)
graphviz.Source(dot_data)```

The variable “LSTAT” seems to be critical to define the partition of the Regression Tree. We’ll check this later once we calculate feature importances.

Evaluate Performance

The quality of a model is related to how well its predictions match up against actual values. Evaluating your machine learning algorithm is an essential part of any project: how can you measure its success and when do you know that it shouldn’t be improved any more? Different machine learning algorithms have varying evaluation metrics, so let’s mention some of the main ones for regression problems:

Mean absolute error (MAE)

Is the mean of the absolute values of the individual prediction errors over all instances in the test set. It tells us how big of an error we can expect on average.

`print(‘Mean Absolute Error:’, metrics.mean_absolute_error(y_test, y_pred))`

Mean squared error (MSE)

Is the mean of the squared prediction errors over all instances in the test set. Because the MSE is squared, its units do not match that of the original output, and also because we are squaring the difference, the MSE will almost always be larger than the MAE: for this reason, we can’t directly compare the MAE to the MSE.

`print(‘Mean Squared Error:’, metrics.mean_squared_error(y_test, y_pred))`

The effect of the square term in the MSE equation is most apparent with the presence of outliers in our data: while each residual in MAE contributes proportionally to the total error, the error grows quadratically in MSE. This ultimately means that outliers in our data will contribute to much higher total error in the MSE than they would in the MAE, and the model will be penalized more for making predictions that differ greatly from the corresponding actual value.

Root mean squared error (RMSE)

Is the square root of the mean of the square of all of the error. By squaring the errors before we calculate their mean and then taking the square root of the mean, we arrive at a measure of the size of the error that gives more weight to the large but infrequent errors than the mean. We can also compare RMSE and MAE to determine whether the forecast contains large but infrequent errors: the larger the difference between RMSE and MAE the more inconsistent the error size.

`print(‘Root Mean Squared Error:’, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))`

R Squared score (R2)

Explains in percentage terms the amount of variation in the response variable that is due to variation in the feature variables. R Squared can take any values between 0 to 1, and although it provides some useful insights regarding the regression model, you shouldn’t rely only on this measure for the assessment of your model.

`print(‘R Squared Score is:’, r2_score(y_test, y_pred))`

The most common interpretation of R Squared is how well the regression model fits the observed data. Like our example, an R Squared of 0,74 reveals that 74% of the data fit the regression model. Although a higher R Squared indicates a better fit for the model, it’s not always the case that a high measure is good for the regression model: the quality of the statistical measure depends on many factors, such as the nature of the variables employed in the model, the units of measure of the variables, and the applied data transformation.

Feature importance

Another key metric consists of assigning scores to input features of a predictive model, indicating the relative importance of each feature when making a prediction. Feature importance provides insights into the data, the model, and represents the basis for dimensionality reduction and feature selection, which can improve the performance of a predictive model. The more an attribute is used to make key decisions with the DT, the higher its relative importance.

```for importance, name in sorted(zip(rt.feature_importances_, X_train.columns),reverse=True):
print (name, importance)```

As highlighted in the visualization, the variable “LSTAT” has a higher importance in relation to other variables (being the main feature of the model). Let’s see that on a plot:

Features “LSTAT” and “RM” account for more than 80% of the importance for making predictions.

We can only compare our model’s error metrics to those of a competing model (e.g. R Squared scores of 2 different models), and although these measures provide valuable insights regarding the model’s performance, always remember:

Just because a forecast has been accurate in the past, it doesn’t mean it will be accurate in the future.

Final thoughts

We’ve covered several steps during our modeling, and each one of them is a discipline on its own: exploratory data analysis, feature engineering, or hyperparameter tuning are all extensive and complex aspects of any machine learning model. You should consider going deeper into those subjects.

One important aspect to look at regarding Decision Trees is the way they partition the data space in comparison to other algorithms. If you had chosen to solve the Boston housing price prediction with linear regression, you’d had visualized a graph like the following:

A linear regression will search for the linear relationship between the target and its predictor. In this example, both variables (“MEDV” and “RM”) seem linearly related which is why this method may work relatively fine, but reality often shows non-linear relationships. Let’s see how a Regression Tree would map the same relationship between target and predictor:

In this example, a Regression Tree that uses MSE as partition criteria and a max_depth of 5 divides the data space in a completely different way, identifying relationships that a linear regression can’t fit.

The way a Decision Tree partitions the data space looking to optimize a given criteria will depend not only on the criteria itself (e.g. MSE or MAE as partition criteria), but on the set up of all hyperparameters. Hyperparameter optimization defines the way a Decision Tree works, and ultimately its performance. Some hyperparameters will deeply affect the performance of the model, and finding their right levels is critical to reaching the best possible performance. In the example below, you can see how the hyperparameter max_depth has a huge influence on the Regression Tree’s R squared score when being set up between 0 and 10, but above 10, any level you choose will have no impact on it:

In order to overcome the fact that you may overfit your model by trying to find the “perfect” hyperparameter levels for your DT, you should consider exploring ensemble methods. Ensemble methods combine several DTs to produce better predictive performance than single DTs. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner, significantly improving the performance of a single DT. They are used to decrease the model’s variance and bias and improve predictions. Now that you saw how a Decision Tree works, I suggest you move forward with ensemble methods like Bagging or Boosting.

Originally posted here.

Diego Lopez Yse

Reshaping with technology. https://www.linkedin.com/in/lopezyse/

1