Sometimes when you fit models to test their predictive accuracy, you find that you’re dealing with too many predictors (feature variables). You can draw upon your domain knowledge, or that of an available domain expert, to reduce predictors until you only have those that will offer your model superior accuracy. But if you lack domain knowledge, there are some automated techniques designed to attack the problem: forward and backward elimination.
Feature engineering is the part of the data science process where you try to identify a subset of the available predictors to use in your model. Using your knowledge of the data, you can select and create predictors that make models and machine learning algorithms work better.
Feature engineering is arguably the most underrated part of machine learning. Just because you have a lot of data doesn’t mean all of it has to be used in the model. In fact, some experts say better, well-conceived features are more valuable than algorithms, and some reports say creative feature engineering wins Kaggle competitions.
The process of feature engineering is as much of an art as a science. It’s good to have a domain expert around for this process, but it’s also good to use your imagination. Although feature engineering is an on-going topic of research, let’s review two distinct automated approaches that you can use.
Let’s start with a regression model with no features and then gradually add one feature at a time, according to which feature improves the model the most.
Basically, you build all possible regression models with a single predictor and pick the best one. Then try all possible models that include that best predictor plus a second predictor. Pick the best of those. You keep adding one feature at a time, and you stop when your model no longer improves or starts worsening.
In the R code below we’ll use the
nuclear data set from the
boot package. This data set contains “Nuclear power station construction data” with 32 observations and 11 variables. To perform the forward elimination feature engineering technique, we’ll use two R functions iteratively,
update to perform a series of tests and update the fitted regression model. The goal is to choose the best model for predicting construction cost.
> library(boot) > # Fit model for cost with intercept term only. > # This model is insufficient for predicting cost. > nuclear_lm0 <- lm(cost~1,data=nuclear) > summary(nuclear_lm0) > # Start forward elimination > # nuclear_lm0 is model we wish to update > # scope arg defines most complex mode to use for fitting, < # namely all predictors. > # text="F" for partial F-test to determine accuracy > add1(nuclear_lm0, scope=.~.+date+t1+t2+cap+pr+ne+ct+bw+cum.n+pt,test="F") Single term additions Model: Cost ~ 1 Df Sum of Sq RSS AIC F value Pr(>F) <none> 897172 329.72 date 1 334335 562837 316.80 17.8205 0.0002071 *** t1 1 186984 710189 324.24 7.8986 0.0086296 ** t2 1 27 897145 331.72 0.0009 0.9760597 cap 1 199673 697499 323.66 8.5881 0.0064137 ** pr 1 9037 888136 331.40 0.3052 0.5847053 ne 1 128641 768531 326.77 5.0216 0.0325885 * ct 1 43042 854130 330.15 1.5118 0.2284221 bw 1 16205 880967 331.14 0.5519 0.4633402 cum.n 1 67938 829234 329.20 2.4579 0.1274266 pt 1 305334 591839 318.41 15.4772 0.0004575 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > # date predictor offers most improvement in modeling cost, > # so update model (could also choose pt) > nuclear_lm1 <- update(nucleasr_lm0,formula=.~.+date) > summary(nuclear_lm1) # Now model includes date > # Call add1 again, this time use nuclear_lm1 model > # this time cap is most significant > add1(nuclear_lm1, scope=.~.+date+t1+t2+cap+pr+ne+ct+bw+cum.n+pt,test="F") > # cap predictor needs to be added to model > nuclear_lm2 <- update(nuclear_lm1,formula=.~.+cap) > summary(nuclear_lm2) > # Call add1 again, this time use nuclear_lm2 model > add1(nuclear_lm2, scope=.~.+date+t1+t2+cap+pr+ne+ct+bw+cum.n+pt,test="F") > # pt predictor needs to be added to model > nuclear_lm3 <- update(nuclear_lm2,formula=.~.+pt) > summary(nuclear_lm3) > # Call add1 again, this time use nuclear_lm3 model > add1(nuclear_lm3, scope=.~.+date+t1+t2+cap+pr+ne+ct+bw+cum.n+pt,test="F") > # ne predictor needs to be added to model > nuclear_lm4 <- update(nuclear_lm3,formula=.~.+ne) > summary(nuclear_lm4) > # Call add1 again, this time use nuclear_lm4 model > # No more predictors would add significance in improvement > # of model. > add1(nuclear_lm4, scope=.~.+date+t1+t2+cap+pr+ne+ct+bw+cum.n+pt,test="F") > # Final model > summary(nuclear_lm4) Call: lm(formula = cost ~ date + t2 + cap + pr + ne + cum.n, data = nuclear) Residuals: Min 1Q Median 3Q Max -152.851 -53.929 -8.827 53.382 155.581 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -9.702e+03 1.294e+03 -7.495 7.55e-08 *** date 1.396e+02 1.843e+01 7.574 6.27e-08 *** t2 4.905e+00 1.827e+00 2.685 0.012685 * cap 4.137e-01 8.425e-02 4.911 4.70e-05 *** pr -8.851e+01 3.479e+01 -2.544 0.017499 * ne 1.502e+02 3.400e+01 4.419 0.000168 *** cum.n -7.919e+00 2.871e+00 -2.758 0.010703 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 80.8 on 25 degrees of freedom Multiple R-squared: 0.8181, Adjusted R-squared: 0.7744 F-statistic: 18.74 on 6 and 25 DF, p-value: 3.796e-08
Now the linear model summary for nuclear_lm4 shows each variable is significant in predicting cost (per the asterisks in the right-most column). This model should work well in predicting construction cost.
A similar approach involves backward elimination of features. Here, you commence with a regression model that includes a full set of predictors, and you gradually remove one at a time according to the predictor whose removal makes the biggest improvement. You stop removing predictors when the removal makes the predictive model worsen.
To perform the backward elimination feature engineering technique, you can use two R functions iteratively,
update to perform a series of tests and update the fitted regression model. From the output of
drop1, you choose the variable to remove from the model that has the least significant effect of reducing the goodness of the fit.
You may find that the final model for forward elimination is different than the final model for backward elimination. This is because predictors in a model affect each other: The estimated coefficients of the predictors at play change as you control for different variables. Automated feature engineering techniques are capricious in nature, despite the methodical way they’re applied.
If you want to learn more about feature engineering, the ODSC West conference in San Francisco will feature a workshop on machine learning vs. feature engineering and how to focus them to predict consumer behavior.
Ready to learn more data science skills and techniques in-person? Register for ODSC West this October 31 – November 3 now and hear from world-renowned names in data science and artificial intelligence!