

Validating Type I and II Errors in A/B Tests in R
RTools & Languagesab testingRposted by John Mount September 5, 2019 John Mount

In our previous article, we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved.
[Related Article: Factors in R]
However, when making predictions on individuals, a biased model may be preferable; biased models may be more accurate, or make predictions with lower relative error than an unbiased model. For example, tree-based ensemble models tend to be highly accurate, and are often the modeling approach of choice for many machine learning applications. In this note, we will show that tree-based models are biased, or uncalibrated. This means they may not always represent the best bias/variance trade-off.
Example: Predicting income
We’ll continue the example from the previous post: predicting income from demographic variables (sex, age, employment, education). The data is from the 2016 US Census American Community Survay (ACS) Public Use Microdata Sample (PUMS) for our example. More information about the data can be found here. First, we’ll get the training and test data, and show how the expected income varies along different groupings (by sex, by employment, and by education):
Three models
We’ll fit three models to the data: two tree ensemble models (random forest and gradient boosting), and one (quasi)Poisson model–a calibrated model–for comparison.
First, we’ll compare the rollups of the predictions to the actual rollups.
Note that the rollups of the predictions from the two ensemble models don’t match the true rollups, even on the training data; unlike the Poisson model, the random forest and gradient boosting models are uncalibrated.
Model performance on holdout data
Let’s see the performance of the models on test data.
The calibrated Poisson model gives better estimates of the income rollups with respect to employment than either of the ensemble models, despite the fact that all the models have similar root mean squared error when making individual predictions.
Conclusion
In this example, the input variables were simply not informative enough, so the tree ensemble models performed equivalently to the Poisson model for predicting income. With more informative (and nonlinear) input variables, one can expect that ensemble models will outperform linear or generalized linear models, in terms of predictions on individuals. However, even these more accurate ensemble models can be biased, so they are not guaranteed to estimate important aggregates (grouped sums or conditional means) correctly.
[Related Article: Jupyter Notebook: Python or R—Or Both?]
In the next note, we’ll propose a polishing step on uncalibrated models that mitigates this bias, potentially enabling models that are both highly accurate on individuals, while estimating certain aggregates correctly.
Originally Posted Here