Feature Selection Using Genetic Algorithms in R

ModelingFeature selectionMachine Learningposted by Pablo Casas February 4, 2019

This is a post about feature selection using genetic algorithms in R, in which we will do a quick review about:...

This is a post about feature selection using genetic algorithms in R, in which we will do a quick review about:

• What are genetic algorithms?
• GA in ML?
• What does a solution look like?
• GA process and its operators
• The fitness function
• Genetics Algorithms in R!
• Try it yourself
• Relating concepts

The intuition behind

Imagine a black box which can help us to decide an unlimited number of possibilities, with a criterion such that we can find an acceptable solution (both in time and quality) to a problem that we formulate.

What are genetic algorithms?

Genetic Algorithms (GA) are a mathematical model inspired by Charles Darwin’s idea of natural selection.

The natural selection preserves only the fittest individuals over the different generations.

Imagine a population of 100 rabbits in 1900, if we look the population today, we see that rabbits are faster and more skillful at finding food than their ancestors.

GA in ML

In machine learning, one of the uses of genetic algorithms is to pick up the right number of variables in order to create a predictive model.

To pick up the right subset of variables is a problem of combinatory and optimization.

The advantage of this technique over others is, it allows the best solution to emerge from the best of prior solutions. It’s an evolutionary algorithm which improves the selection over time.

The idea of GA is to combine the different solutions generation after generation to extract the best genes (variables) from each one.

We can find other uses of GA such as hyper-tunning parameters, finding the maximum (or minimum) of a function or the search for a correct neural network architecture (Neuroevolution), and more.

GA in feature selection

Every possible solution of the GA, which are the selected variables (a single 🐇), are considered as a whole. It will not rank variables individually against the target.

And this is important because we already know that variables work in group.

What does a solution look like?

Keeping it simple for the example, imagine we have a total of 6 variables,

One solution can be picking up 3 variables, let’s say: `var2``var4`, and `var5`.

Another solution can be: `var1` and `var5`.

These solutions are the so-called individuals or chromosomes in a population. They are possible solutions to our problem.

Credit image: Vijini Mallawaarachchi

From the image, solution 3 can be expressed as a one-hot vector: `c(1,0,1,0,1,1)`. Each `1` indicates the solution containg that variable. In this case: `var1``var3``var5``var6`.

While solution 4 is: `c(1,1,0,1,1,0)`.

Each position in the vector is a gene.

GA process and its operators

The underlying idea of a GA is to generate some random possible solutions (called `population`), which represent different variables, to then combine the best solutions in an iterative process.

This combination follows the basic GA operations, which are: selection, mutation and cross-over.

• Selection: Pick up the most fitted individuals in a generation (i.e.: the solutions providing the highest ROC).
• Cross-over: Create 2 new individuals, based on the genes of two solutions. These children will appear in the next generation.
• Mutation: Change a gene randomly in the individual (i.e.: flip a `0` to `1`)

The idea is for each generation, we will find better individuals, like a fast rabbit.

I recommend the post of Vijini Mallawaarachchi about how a genetic algorithm works.

These basic operations allow the algorithm to change the possible solutions by combining them in a way that maximizes the objective.

The fitness function

This objective maximization is, for example, to keep with the solution that maximizes the area under the ROC curve. This is defined in the fitness function.

The fitness function takes a possible solution (or chromosome, if you want to sound more sophisticated), and somehow evaluates the effectiveness of the selection.

Normally, the fitness function takes the one-hot vector `c(1,1,0,0,0,0)`, creates, for example, a random forest model with `var1` and `var2`, and returns the fitness value (ROC).

The fitness value in this code calculates is: `ROC value / number of variables`. By doing this the algorithm penalizes the solutions with a large number of variables. Similar to the idea of Akaike information criterion, or AIC.

Genetics Algorithms in R! 🐛

My intention is to provide you with a clean code so you can understand what’s behind, while at the same time, try new approaches like modifying the fitness function. This is a crucial point.

To use on your own data set, make sure `data_x` (data frame) and `data_y` (factor) are compatible with the `custom_fitness` function.

The main library is `GA`, developed by Luca Scrucca. See here the vignette with examples.

📣 Important: The following code is incomplete. Clone the repository to run the example.

``````# data_x: input data frame
# data_y: target variable (factor)

# GA parameters
param_nBits=ncol(data_x)
col_names=colnames(data_x)

# Executing the GA
ga_GA_1 = ga(fitness = function(vars) custom_fitness(vars = vars,
data_x =  data_x,
data_y = data_y,
p_sampling = 0.7), # custom fitness function
type = "binary", # optimization data type
crossover=gabin_uCrossover,  # cross-over method
elitism = 3, # best N indiv. to pass to next iteration
pmutation = 0.03, # mutation rate prob
popSize = 50, # the number of indivduals/solutions
nBits = param_nBits, # total number of variables
names=col_names, # variable name
run=5, # max iter without improvement (stopping criteria)
maxiter = 50, # total runs or generations
monitor=plot, # plot the result at each iteration
keepBest = TRUE, # keep the best solution at the end
parallel = T, # allow parallel procesing
seed=84211 # for reproducibility purposes
)
``````
``````# Checking the results
summary(ga_GA_1)
``````
``````── Genetic Algorithm ───────────────────

GA settings:
Type                  =  binary
Population size       =  50
Number of generations =  50
Elitism               =  3
Crossover probability =  0.8
Mutation probability  =  0.03

GA results:
Iterations             = 17
Fitness function value = 0.2477393
Solution =
radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean
[1,]           0            1              0         0               0                1
concavity_mean concave points_mean symmetry_mean fractal_dimension_mean  ...
[1,]              0                   0             0                      0
symmetry_worst fractal_dimension_worst
[1,]              0                       0
``````
``````# Following line will return the variable names of the final and best solution
best_vars_ga=col_names[ga_GA_1@solution[1,]==1]

# Checking the variables of the best solution...
best_vars_ga
``````
``````[1] "texture_mean"     "compactness_mean" "area_worst"       "concavity_worst"
``````

• Blue dot: Population fitness average
• Green dot: Best fitness value

Note: Don’t expect the result that fast 😅

Now we calculate the accuracy based on the best selection!

``````get_accuracy_metric(data_tr_sample = data_x, target = data_y, best_vars_ga)
``````
``````[1] 0.9508279
``````

The accuracy is around 95,08%, while the ROC value is closed to 0,95 (ROC=fitness value * number of variables, check the fitness function).

Analyzing the results

I don’t like to analyze the accuracy without the cutpoint (Scoring Data), but it’s useful to compare with the results of this Kaggle post.

He got a similar accuracy result using recursive feature elimination, or RFE, based on 5 variables, while our solution stays with 4.

Try it yourself

Try a new fitness function, some solutions still provide a large number of variables, you can try squaring the number of variables.

Another thing to try is the algorithm to get the ROC value, or even to change the metric.

Some configurations last a lot of time. Balance classes before modeling and play with the `p_sampling` parameter. Sampling techniques can have a big impact on models. Check the Sample size and class balance on model performance post for more info.

How about changing the rate of mutation or elitism? Or trying other cross-over methods?

Increase the `popSize` to test more possible solutions at the same time (at a time cost).

Feel free to share any insights or ideas to improve the selection.

Clone the repository to run the example.

Relating concepts

There is a parallelism between GA and Deep Learning, the concept of iteration and improvement over time is similar.

I added the `p_sampling` parameter to speed up things. And it usually accomplishes its goal. Similar to the batch concept used in Deep Learning. Another parallel is between the GA parameter `run` and the early stopping criteria in the neural network training.

But the biggest similarity is both techniques come from observing the nature. In both cases, humans observed how neural networks and genetics work, and create a simplified mathematical model that imitate their behavior. Nature has millions of years of evolution, why not try to imitate it? 🌱

I tried to be brief about GA, but if you have any specific question on this vast topic, please leave it in the comments 🙋 🙋‍♂

And, if I didn’t motivate you the enough to study GA, check this project which is based on Neuroevolution:

Original post here.

Pablo Casas

I've been in touch with data for the last 10 years, working and playing with data in different areas, either for business or R&D. I'm graduated from Information System Engineering (Universidad Tecnológica Nacional - Argentina). Nowadays I'm working as Machine Learning Specialist in Auth0.com, developing deep learning user behavior models and predictive modeling for marketing and sales. I've passion for teaching all the concepts I learned using gentle examples, helping them not to get bogged down by complex issues. I wrote the Data Science Live Book (DSLB) -open source- which addresses the not-so-popular but highly needed tasks in a data project, such as exploratory data analysis and data preparation for machine learning. Backed by the reader's intuition and logic, the DSLB introduces in a gentle way different concepts and R code receipts ready to be used in real-world problems.

1