I like to call linear regression the data scientist’s “workhorse.” It may not be sexy, but it’s a tried and proven technique that can be very useful. When the problem you’re trying to solve requires the prediction of a numeric response variable using multiple continuous (numeric) and/or categorical predictors, then regression is the tool to use. In a nutshell, regression is a method for investigating the functional relationship among variables.
The way to go about regression is straightforward – draw some scatterplots of each predictor variable and the response variable to determine whether they appear to be related, do some manual or automated feature engineering to select the predictors, fit the model using the linear model function available in your language of choice (e.g. lm() in R) to perform least squares, use the regression coefficients to plot the regression line, evaluate the accuracy of the model, and then make predictions.
In this article, we’ll take a look at the most nuanced step in the process – evaluating accuracy – using a variety of different model diagnostics below.
If your model is having trouble due to missing values in the data set, try using an imputation process rather than removing the observation. R has a number of powerful packages for imputing missing values.
Check for Collinearity
You should take time to review your data set to make sure the data shows a lack of multicollinearity where one predictor can be linearly predicted from the other predictors with a significant degree of accuracy.
Check for outliers and influential points
You should take the time to review your data set to make sure there are no major outliers, or any data points that exhibit excessive influence on the balance of the data set.
Check for non-linearity
You should check that the variable relationships exhibit linearity, i.e. the response variable has a linear relationship with each of the predictor variables. If your data has non-linear relationships, there is a nice R package called nlstools designed for handling non-linear regression models.
Check for interactions
Make sure you understand interaction effects between predictors – an additional change to the response variable that occurs with particular combinations of the predictors. Interactions can happen between categorical variables, numeric variables, or both. It is most common to see two-way interactions, interactions between precisely two predictors.
Check normality of the residuals
The residuals obtained by the fitting process should be normally distributed. Using a QQ-plot visualization is a good diagnostic for determining normality. If you find that your residuals are non-normal, you can divide up the data set into segments that share similar statistical distributions. Then you simply make separate models from the distinct data sets. Non-normality of residuals may also be due to large outliers. If you find that these outliers are due to errors, then you can just remove the outliers.
R2 (also known as the coefficient of determination) values are calculated during the fitting process and are between 0 and 1.0. The closer to 1 offers the best accuracy. R is just the multiple correlation coefficient. As you’re doing feature engineering and evaluating subsets of predictors, the set of variables with the highest R2 value are the best fit variables for the model.
The F-test of overall significance indicates whether the model provides a better fit to the data set than a model that contains no predictors. Its value will range from zero to an arbitrarily large number.
Root Mean Squared Error (RMSE)
RMSE is the square root of the variance of the residuals, and as such this metric indicates the absolute fit of the model to the data or how close the observed data points are to the model’s predicted values. Smaller values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response.
Make a Prediction
Once you’ve gone through the above diagnostic checklist, you can make predictions with a good level of confidence that they’ll be accurate. But it’s good practice to check the predictions by (i) checking results predicted by your model with a domain expert, (ii) checking predicted values by collecting new data and checking it against results predicted by your model, (iii) cross-validating results by splitting the data set into two randomly selected samples. Use one subset, the “training set” to estimate model parameters, and then use the second subset, the “test set” for checking the predictive result of your model.
Regression diagnostics represents the more creative part of this class of supervised learning methods, but if you approach the above list of tips methodically, you’ll find your models will provide more accurate predictions.
To more fully grasp the diagnostics in this article, the reader should investigate the statistical theory and mathematical foundations of regression. A particularly good book for this undertaking is “A Modern Approach to Regression with R,” by Simon J. Sheather.