fbpx
Feature Engineering with Forward and Backward Elimination Feature Engineering with Forward and Backward Elimination
Sometimes when you fit models to test their predictive accuracy, you find that you’re dealing with too many predictors (feature variables).... Feature Engineering with Forward and Backward Elimination

Sometimes when you fit models to test their predictive accuracy, you find that you’re dealing with too many predictors (feature variables). You can draw upon your domain knowledge, or that of an available domain expert, to reduce predictors until you only have those that will offer your model superior accuracy. But if you lack domain knowledge, there are some automated techniques designed to attack the problem: forward and backward elimination.

Feature engineering is the part of the data science process where you try to identify a subset of the available predictors to use in your model. Using your knowledge of the data, you can select and create predictors that make models and machine learning algorithms work better.

Feature engineering is arguably the most underrated part of machine learning. Just because you have a lot of data doesn’t mean all of it has to be used in the model. In fact, some experts say better, well-conceived features are more valuable than algorithms, and some reports say creative feature engineering wins Kaggle competitions. 

The process of feature engineering is as much of an art as a science. It’s good to have a domain expert around for this process, but it’s also good to use your imagination. Although feature engineering is an on-going topic of research, let’s review two distinct automated approaches that you can use.

Forward Elimination

Let’s start with a regression model with no features and then gradually add one feature at a time, according to which feature improves the model the most.

Basically, you build all possible regression models with a single predictor and pick the best one. Then try all possible models that include that best predictor plus a second predictor. Pick the best of those. You keep adding one feature at a time, and you stop when your model no longer improves or starts worsening.

In the R code below we’ll use the nuclear data set from the boot package. This data set contains “Nuclear power station construction data” with 32 observations and 11 variables. To perform the forward elimination feature engineering technique, we’ll use two R functions iteratively, add1 and update to perform a series of tests and update the fitted regression model. The goal is to choose the best model for predicting construction cost.

> library(boot)

> # Fit model for cost with intercept term only.
> # This model is insufficient for predicting cost. 
> nuclear_lm0 <- lm(cost~1,data=nuclear)
> summary(nuclear_lm0)

> # Start forward elimination
> # nuclear_lm0 is model we wish to update
> # scope arg defines most complex mode to use for fitting,
< # namely all predictors.
> # text="F" for partial F-test to determine accuracy
> add1(nuclear_lm0,
 scope=.~.+date+t1+t2+cap+pr+ne+ct+bw+cum.n+pt,test="F")
Single term additions

 Model:
 Cost ~ 1

        Df      Sum of Sq    RSS  AIC F   value    Pr(>F) 
<none>                    897172  329.72              

date     1      334335    562837  316.80  17.8205  0.0002071 ***

t1       1      186984    710189  324.24  7.8986   0.0086296 ** 

t2       1      27        897145  331.72  0.0009   0.9760597 

cap      1      199673    697499  323.66  8.5881   0.0064137 ** 

pr       1      9037      888136  331.40  0.3052   0.5847053 

ne       1      128641    768531  326.77  5.0216   0.0325885 * 

ct       1      43042     854130  330.15  1.5118   0.2284221 

bw       1      16205     880967  331.14  0.5519   0.4633402 

cum.n    1      67938     829234  329.20  2.4579   0.1274266 

pt       1      305334    591839  318.41  15.4772  0.0004575 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> # date predictor offers most improvement in modeling cost, 
> # so update model (could also choose pt)
> nuclear_lm1 <- update(nucleasr_lm0,formula=.~.+date)
> summary(nuclear_lm1)    # Now model includes date

> # Call add1 again, this time use nuclear_lm1 model
> # this time cap is most significant
> add1(nuclear_lm1,
 scope=.~.+date+t1+t2+cap+pr+ne+ct+bw+cum.n+pt,test="F")

> # cap predictor needs to be added to model
> nuclear_lm2 <- update(nuclear_lm1,formula=.~.+cap)
> summary(nuclear_lm2)

> # Call add1 again, this time use nuclear_lm2 model
> add1(nuclear_lm2,
 scope=.~.+date+t1+t2+cap+pr+ne+ct+bw+cum.n+pt,test="F")

> # pt predictor needs to be added to model
> nuclear_lm3 <- update(nuclear_lm2,formula=.~.+pt)
> summary(nuclear_lm3)

> # Call add1 again, this time use nuclear_lm3 model
> add1(nuclear_lm3,
 scope=.~.+date+t1+t2+cap+pr+ne+ct+bw+cum.n+pt,test="F")

> # ne predictor needs to be added to model
> nuclear_lm4 <- update(nuclear_lm3,formula=.~.+ne)
> summary(nuclear_lm4)

> # Call add1 again, this time use nuclear_lm4 model
> # No more predictors would add significance in improvement 
> # of model.
> add1(nuclear_lm4,
 scope=.~.+date+t1+t2+cap+pr+ne+ct+bw+cum.n+pt,test="F")

> # Final model
> summary(nuclear_lm4)      
Call:
lm(formula = cost ~ date + t2 + cap + pr + ne + cum.n, data = nuclear)

Residuals:
     Min     1Q Median       3Q Max 

-152.851  -53.929 -8.827   53.382 155.581 

Coefficients:

              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -9.702e+03  1.294e+03 -7.495 7.55e-08 ***
date         1.396e+02 1.843e+01   7.574 6.27e-08 ***
t2           4.905e+00 1.827e+00   2.685 0.012685 * 
cap          4.137e-01 8.425e-02   4.911 4.70e-05 ***
pr          -8.851e+01 3.479e+01  -2.544 0.017499 * 
ne           1.502e+02 3.400e+01   4.419 0.000168 ***
cum.n       -7.919e+00 2.871e+00  -2.758 0.010703 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 80.8 on 25 degrees of freedom
Multiple R-squared:  0.8181, Adjusted R-squared:  0.7744 
F-statistic: 18.74 on 6 and 25 DF,  p-value: 3.796e-08

Now the linear model summary for nuclear_lm4 shows each variable is significant in predicting cost (per the asterisks in the right-most column). This model should work well in predicting construction cost.

Backward Elimination

A similar approach involves backward elimination of features. Here, you commence with a regression model that includes a full set of predictors, and you gradually remove one at a time according to the predictor whose removal makes the biggest improvement. You stop removing predictors when the removal makes the predictive model worsen.

To perform the backward elimination feature engineering technique, you can use two R functions iteratively, drop1 and update to perform a series of tests and update the fitted regression model. From the output of drop1, you choose the variable to remove from the model that has the least significant effect of reducing the goodness of the fit.

You may find that the final model for forward elimination is different than the final model for backward elimination. This is because predictors in a model affect each other: The estimated coefficients of the predictors at play change as you control for different variables. Automated feature engineering techniques are capricious in nature, despite the methodical way they’re applied.

If you want to learn more about feature engineering, the ODSC West conference in San Francisco will feature a workshop on machine learning vs. feature engineering and how to focus them to predict consumer behavior.


Ready to learn more data science skills and techniques in-person? Register for ODSC West this October 31 – November 3 now and hear from world-renowned names in data science and artificial intelligence!

Daniel Gutierrez, ODSC

Daniel D. Gutierrez is a practicing data scientist who’s been working with data long before the field came in vogue. As a technology journalist, he enjoys keeping a pulse on this fast-paced industry. Daniel is also an educator having taught data science, machine learning and R classes at the university level. He has authored four computer industry books on database and data science technology, including his most recent title, “Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R.” Daniel holds a BS in Mathematics and Computer Science from UCLA.

1