fbpx
Identifying Poisonous Mushrooms with Rule Learners Identifying Poisonous Mushrooms with Rule Learners
Each year, many people fall ill and sometimes even die from ingesting poisonous wild mushrooms. Since many mushrooms are very similar to each other... Identifying Poisonous Mushrooms with Rule Learners

Each year, many people fall ill and sometimes even die from ingesting poisonous wild mushrooms. Since many mushrooms are very similar to each other in appearance, occasionally even experienced mushroom gatherers are poisoned.

If simple, clear, and consistent rules were available for identifying poisonous mushrooms, they could save the lives of foragers.

This article is an excerpt from the book, Machine Learning with R, Third Edition written by Brett Lantz. This book provides a hands-on, readable guide to applying machine learning to real-world problems.

 

Collecting data

To identify rules for distinguishing poisonous mushrooms, we will utilize the Mushroom dataset by Jeff Schlimmer of Carnegie Mellon University. The raw dataset is available freely at the UCI Machine Learning Repository (http:// archive.ics.uci.edu/ml).

The dataset includes information on 8,124 mushroom samples from 23 species of gilled mushrooms listed in the Audubon Society Field Guide to North American Mushrooms (1981).

 

Exploring and preparing the data

We begin by using read.csv() to import the data for our analysis. Since all 22 features and the target class are nominal, we will set stringsAsFactors = TRUE to take advantage of the automatic factor conversion:

> mushrooms <- read.csv(“mushrooms.csv”, stringsAsFactors = TRUE)

The output of the str(mushrooms) command notes that the data contains 8,124 observations of 23 variables as the data dictionary had described. While most of the str() output is unremarkable, one feature is worth mentioning. Do you notice anything peculiar about the veil_type variable in the following line?

$ veil_type : Factor w/ 1 level “partial”: 1 1 1 1 1 1 …

If you think it is odd that a factor has only one level, you are correct. The data dictionary lists two levels for this feature: partial and universal, however, all examples in our data are classified as partial. It is likely that this data element was somehow coded incorrectly. In any case, since the veil type does not vary across samples, it does not provide any useful information for prediction. We will drop this variable from our analysis using the following command:

> mushrooms$veil_type <- NULL

By assigning NULL to the veil_type vector, R eliminates the feature from the mushrooms data frame.

Before going much further, we should take a quick look at the distribution of the mushroom type variable in our dataset:

> table(mushrooms$type)

edible poisonous

4208 3916

About 52% of the mushroom samples (N = 4,208) are edible, while 48% (N = 3,916) are poisonous.

 

Training a model on the data

If we trained a hypothetical ZeroR classifier on this data, what would it predict? Since ZeroR ignores all of the features and simply predicts the target’s mode, in plain language, its rule would state that “all mushrooms are edible.” Obviously, this is not a very helpful classifier because it would leave a mushroom gatherer sick or dead for nearly half of the mushroom samples! Our rules will need to do much better than this benchmark in order to provide safe advice that can be published. At the same time, we need simple rules that are easy to remember.

Since simple rules can still be useful, let’s see how a very simple rule learner performs on the mushroom data. Toward this end, we will apply the 1R classifier, which will identify the single feature that is the most predictive of the target class and use this feature to construct a rule.

We will use the 1R implementation found in the OneR package by Holger von Jouanne-Diedrich at the Aschaffenburg University of Applied Sciences. It can be installed using the command install.packages(“OneR”) and loaded by typing library(OneR).


The OneR() function uses the R formula syntax to specify the model to be trained. The formula syntax uses the ~ operator (known as the tilde) to express the relationship between a target variable and its predictors. To include all of the variables in the model, the period character is used. Using the formula type ~ . with OneR() allows our first rule learner to consider all possible features in the mushroom data when predicting mushroom type:

> mushroom_1R <- OneR(type ~ ., data = mushrooms)

To examine the rules it created, we can type the name of the classifier object:

> mushroom_1R

Call:

OneR.formula(formula = type ~ ., data = mushrooms)

Rules:

If odor = almond then type = edible

If odor = anise then type = edible

If odor = creosote then type = poisonous

If odor = fishy then type = poisonous

If odor = foul then type = poisonous

If odor = musty then type = poisonous

If odor = none then type = edible

If odor = pungent then type = poisonous

If odor = spicy then type = poisonous

Accuracy:

8004 of 8124 instances classified correctly (98.52%)

Examining the output, we see that the odor feature was selected for rule generation. For the purposes of a field guide for mushroom gathering, these rules could be summarized in a simple rule of thumb: “if the mushroom smells unappetizing, then it is likely to be poisonous.”

 

Conclusion

In this article, we used the 1R classifier to develop rules for identifying poisonous mushrooms. The 1R algorithm used a single feature to achieve 99% accuracy in identifying potentially fatal mushroom samples. Anything short of perfection, however, runs the risk of poisoning someone if the model were to classify a poisonous mushroom as edible. Read more about how to refine and improve the results, along with many other real-world examples like this one, in the latest edition of acclaimed classic Machine Learning with R – Third Edition, written by Brett Lantz.

Brett Lantz

Brett Lantz

Brett Lantz (@DataSpelunking) has spent more than 10 years using innovative data methods to understand human behavior. A sociologist by training, Brett was first captivated by machine learning during research on a large database of social network profiles. Brett is a DataCamp instructor and a frequent speaker at machine learning conferences and workshops around the world. He is known to geek out about data science applications for sports, autonomous vehicles, foreign language learning, and fashion, among many other subjects, and hopes to one day blog about these subjects at dataspelunking.com, a website dedicated to sharing knowledge about the search for insight in data.

1