fbpx
Naive Bayes and Spam Detection Naive Bayes and Spam Detection
In natural language processing, text classification techniques are used to assign a class to a given text.  For example, in spam detection, the classifiers... Naive Bayes and Spam Detection

In natural language processing, text classification techniques are used to assign a class to a given text.  For example, in spam detection, the classifiers decides an email belongs to a spam or non spam (ham) class. Deciding what the topic of a news article is, or whether a movie review is positive or negative, Authorship attribution, gender identification, sentiment analysis, language detection and topic detection are all examples of text classification.

Let’s assume we have a training set of m hand labeled documents. The set of classes (labels) are ( C = {c_1,c_2,…,c_J} ). Our goal is to build a classifier that assigns a class to each document, denoted by (d).

In other words the classifier answers the following question: Given a set of words, what is the probability they belong to a given class?

There are different kinds of text classification techniques. Among them Naive Bayes is the most popular. Naive Bayes is based on the famous Bayes Rule:   

( P(c|d) = frac{P(d|c) P(c)}{P(d)} )  

Naive Bayes tries to maximizes (P(c|d)/) among all available classes. In other words:

(c_{NB} =mathop{arg,max}limits_{c_j in C} P(c_j|d) )

Since (P(d)) is independent of the choice of classes, we can drop it. Therefore the Naive Bayes classifier is   

(c_{NB} =mathop{arg,max}limits_{c_j in C} P(d|c_j)P(c_j) )

Computing /(P(c_j))  is easy as will be explained shortly. To compute (P(d|c_j)), we need to consider all individual words in the document. In other words, assuming there are (n) distinct words (w_1,w_2,…,w_n) in (d), we compute (P(w_1,w_2,…w_n|c)).  We also assume that that these individual words are conditionally independent given the class:

(P(w_1,w_2,…w_n|c) = P(w_1|c) P(w_2|c)…P(w_n|c))

Therefore, the Naive Bayes Classifier can be written as:

(c_{NB} = mathop{arg,max}limits_{c_j in C} P(c_j) prod_{i=1}^n P(w_i|c_j))

Let’s build a classifier for email spam detection using Naive Bayes. Since the classifier relies on historical observations, we need a way to train it. So how will we train the classifier? we will just use the individual words that make up the email message. These words and their association with either spam ((c_1)) or ham ((c_2)) messages will form the basis of our classifier.

For example, suppose we have trained our classifier using 500 email messages,120 are spam and 380 are ham. To apply Naive Bayes, we simply calculate the probability of each individual word given ham or spam in the email documents, along with the probabilities of ham and spam classes and plug them into the above equation.

Suppose that the word money appears in 100 spam emails, and in only 50 ham emails. Suppose also that the total number of words in spam emails is 2500 and 10,000 for ham emails. Thus, we have two classes ham and spam with probabilities, (P(c_1)=P(spam) = frac{120}{500}) and (P(c_2)=P(ham) = frac{380}{500}). To calculate (P(money | spam)), we need to know how often the word money appears relative to the total words in spam emails. Similarly,(P(money | ham)) is the number of times that the word money appears relative to all words in ham emails. So we have (P(money | spam) = frac{100}{2500}) and (P(money | ham) = frac{50}{10000}).    

In practice, the word money may not appear in any spam emails (or ham emails), and hence  (P(money | spam)= 0) and consequently (P(c_j) prod_{i=1}^n P(w_i|c_j) = 0). To avoid such a scenario, we use Laplace (add-one) smoothing technique. That is, we add a “one” to both numerator and denominator when we compute the conditional probabilities for each word. Therefore, for this example, (P(money | spam) = frac{101}{2501}) and (P(money | ham) = frac{51}{10001}).     

Then, given a new unlabeled email, call it (d_{test}), we can predict whether it is ham or spam by computing the product of conditional probabilities for all individual words found in the emails and subsequently compute (P(spam|d_{test}))  and (P(ham|d_{test})). If (P(spam|d_{test}))  is higher than (P(ham|d_{test})), we predict the email to be spam.  

In general, Naive Bayes is fast and robust to ireverant features. Thus, it is used widely as a good reliable baseline for text classification. Python’s machine learning toolkit, Scikit- learn has several implementations of Naive Bayes. More information


©ODSC 2016, Feel free to share + backlink!

1