Linear regression serves as the data scientist’s workhorse, but this statistical learning method is limited in that the focus of Ordinary Least Squares regression is on linear models of continuous variables. However, much data of interest to data scientists are not continuous and so other methods must be used to create useful predictive models. Generalized linear models (GLM) expands upon linear regression to include non-normal distributions including binary outcome data, count data, probability data, proportion data, and many other data types. In general, a GLM is used for analyzing linear and non-linear effects of continuous and categorical predictor variables on a discrete or continuous response variable. For instance, for predicting a categorical outcome (where the response variable is true/false, 1/0, etc.) it is often advised to use a form of GLM called logistic regression.
There are a number of packages in R that implement GLMs. In fact, there is a useful CRAN vignette that provides a quick analysis of all CRAN packages that have “glm” in their name. It’s fun to explore all the work people around the globe have done under the name of GLM, but for our purposes, we’ll stick with the glm() function in the base R stats package.
Let’s look at a simple example where we’d like to model binary data with a GLM. We can use the well-known “Motor Trend Car Road Test” dataset mtcars available in base R. This data set has 32 observations and 11 variables. The variable vs indicates the type of engine a car has, 0 for V-shaped, and 1 for straight. This will be our response variable. We want to create a model that helps us to predict the probability of a vehicle having a V-shaped engine or a straight engine. In order to do this, we’ll use two predictors, wt (Weight – 1000 lbs) and disp (Displacement – cubic inches).
< # Fit the model with the usual response/predictor
< # formula, and specify a binomal error distribution
< glm1 <- glm(formula= vs ~ wt + disp, data=mtcars,
glm(formula = vs ~ wt + disp, family = binomial, data = mtcars)
Min 1Q Median 3Q Max
-1.67506 -0.28444 -0.08401 0.57281 2.08234
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.60859 2.43903 0.660 0.510
wt 1.62635 1.49068 1.091 0.275
disp -0.03443 0.01536 -2.241 0.025 *
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.86 on 31 degrees of freedom
Residual deviance: 21.40 on 29 degrees of freedom
Number of Fisher Scoring iterations: 6
We see from the coefficient estimates that wt influences vs positively, while disp has a slightly negative effect. We also see that the coefficient of weight is non-significant (p > 0.05), while the coefficient of displacement is significant (see the asterisk).
We can now use the fitted model to make predictions (we’ll see how good the predictions are as well). We want to predict vs given a weight of 3,000 lbs. and engine displacement of 150 cubic inches.
> # Use fit model to predict vs for weight of 3,000
> # lbs. and displacement of 250
> newdata = data.frame(wt = 3.0, disp = 150)
> predict(glm1, newdata, type=”response”)
The predicted probability is ~ 79%.
Quality of Fit Metrics
Now let’s take a look at some ways to determine the quality of fit for a GLM using various metrics computed by the algorithm.
- Deviance – In the output for the fitted model glm1 we see two important metrics, Null deviance and Residual deviance. Deviance is a quality of fit measurement for a GLM where larger values indicate a poorer fit. The Null deviance shows how well the response variable is predicted by a model that includes only the intercept (grand mean of all the groups). For our example, we have a value of 43.9 on 31 degrees of freedom. Subsequently including the independent variables, wt and disp, serve to decrease the deviance to 21.4 on 29 degrees of freedom, a significant reduction in deviance. Similarly, the Residual deviance has reduced by 22.46 with a loss of two degrees of freedom.
- Fisher Scoring – This metric has to do with how the model was estimated. Fisher scoring uses an iterative approach (the Newton-Raphson algorithm by default). Essentially, the model is fit based on a guess for the estimate values. The algorithm then determines whether the fit would be improved by using different estimates. If so, it moves in that direction (e.g. using a higher value for the estimate) and then fits the model again. The algorithm stops when it feels that moving again would not yield much additional improvement. The line in the GLM output shows how many iterations it took to complete the process (convergence). For glm1 we see that Fisher Scoring needed six iterations to perform the fit.
- The Akaike Information Criterion (AIC) – This is another quality of fit metric that takes into account the ability of the model to fit the data. This is very useful as it provides a method for assessing the quality of a model through comparison of related models. It’s based on the Deviance metric, but penalizes you for making the model more complicated. Much like adjusted R-squared, its intent is to prevent you from including irrelevant predictors. However, unlike adjusted R-squared, the number itself is not meaningful. If you have more than one similar candidate models (where all of the variables of the simpler model occur in the more complex models), then you should select the model that has the smallest AIC. So it’s useful for comparing models, but isn’t interpretable on its own.
- Hosmer-Lemeshow – Another approach for determining binary data quality of fit is to use the Hosmer Lemeshow test (see R code below). The glm1 model appears to fit well because there is no significant difference between the model and the observed data (i.e. the p-value is above 0.05). As with other model fit metrics, we can use this test as just one piece of information in deciding how well the model fits. It doesn’t work well in very large or very small data sets but is often useful nevertheless.
> hoslem.test(mtcars$vs, fitted(glm1))
Hosmer and Lemeshow goodness of fit (GOF) test
data: mtcars$vs, fitted(glm1)
X-squared = 6.4717, df = 8, p-value = 0.5945
To learn more about GLM, attend the workshop at ODSC west: General Training Session: Machine Learning in R Part I