Each year, many people fall ill and sometimes even die from ingesting poisonous wild mushrooms. Since many mushrooms are very similar to each other in appearance, occasionally even experienced mushroom gatherers are poisoned.
If simple, clear, and consistent rules were available for identifying poisonous mushrooms, they could save the lives of foragers.
This article is an excerpt from the book, Machine Learning with R, Third Edition written by Brett Lantz. This book provides a hands-on, readable guide to applying machine learning to real-world problems.
To identify rules for distinguishing poisonous mushrooms, we will utilize the Mushroom dataset by Jeff Schlimmer of Carnegie Mellon University. The raw dataset is available freely at the UCI Machine Learning Repository (http:// archive.ics.uci.edu/ml).
The dataset includes information on 8,124 mushroom samples from 23 species of gilled mushrooms listed in the Audubon Society Field Guide to North American Mushrooms (1981).
Exploring and preparing the data
We begin by using read.csv() to import the data for our analysis. Since all 22 features and the target class are nominal, we will set stringsAsFactors = TRUE to take advantage of the automatic factor conversion:
> mushrooms <- read.csv(“mushrooms.csv”, stringsAsFactors = TRUE)
The output of the str(mushrooms) command notes that the data contains 8,124 observations of 23 variables as the data dictionary had described. While most of the str() output is unremarkable, one feature is worth mentioning. Do you notice anything peculiar about the veil_type variable in the following line?
$ veil_type : Factor w/ 1 level “partial”: 1 1 1 1 1 1 …
If you think it is odd that a factor has only one level, you are correct. The data dictionary lists two levels for this feature: partial and universal, however, all examples in our data are classified as partial. It is likely that this data element was somehow coded incorrectly. In any case, since the veil type does not vary across samples, it does not provide any useful information for prediction. We will drop this variable from our analysis using the following command:
> mushrooms$veil_type <- NULL
By assigning NULL to the veil_type vector, R eliminates the feature from the mushrooms data frame.
Before going much further, we should take a quick look at the distribution of the mushroom type variable in our dataset:
About 52% of the mushroom samples (N = 4,208) are edible, while 48% (N = 3,916) are poisonous.
Training a model on the data
If we trained a hypothetical ZeroR classifier on this data, what would it predict? Since ZeroR ignores all of the features and simply predicts the target’s mode, in plain language, its rule would state that “all mushrooms are edible.” Obviously, this is not a very helpful classifier because it would leave a mushroom gatherer sick or dead for nearly half of the mushroom samples! Our rules will need to do much better than this benchmark in order to provide safe advice that can be published. At the same time, we need simple rules that are easy to remember.
Since simple rules can still be useful, let’s see how a very simple rule learner performs on the mushroom data. Toward this end, we will apply the 1R classifier, which will identify the single feature that is the most predictive of the target class and use this feature to construct a rule.
We will use the 1R implementation found in the OneR package by Holger von Jouanne-Diedrich at the Aschaffenburg University of Applied Sciences. It can be installed using the command install.packages(“OneR”) and loaded by typing library(OneR).
The OneR() function uses the R formula syntax to specify the model to be trained. The formula syntax uses the ~ operator (known as the tilde) to express the relationship between a target variable and its predictors. To include all of the variables in the model, the period character is used. Using the formula type ~ . with OneR() allows our first rule learner to consider all possible features in the mushroom data when predicting mushroom type:
> mushroom_1R <- OneR(type ~ ., data = mushrooms)
To examine the rules it created, we can type the name of the classifier object:
OneR.formula(formula = type ~ ., data = mushrooms)
If odor = almond then type = edible
If odor = anise then type = edible
If odor = creosote then type = poisonous
If odor = fishy then type = poisonous
If odor = foul then type = poisonous
If odor = musty then type = poisonous
If odor = none then type = edible
If odor = pungent then type = poisonous
If odor = spicy then type = poisonous
8004 of 8124 instances classified correctly (98.52%)
Examining the output, we see that the odor feature was selected for rule generation. For the purposes of a field guide for mushroom gathering, these rules could be summarized in a simple rule of thumb: “if the mushroom smells unappetizing, then it is likely to be poisonous.”
In this article, we used the 1R classifier to develop rules for identifying poisonous mushrooms. The 1R algorithm used a single feature to achieve 99% accuracy in identifying potentially fatal mushroom samples. Anything short of perfection, however, runs the risk of poisoning someone if the model were to classify a poisonous mushroom as edible. Read more about how to refine and improve the results, along with many other real-world examples like this one, in the latest edition of acclaimed classic Machine Learning with R – Third Edition, written by Brett Lantz.