The Naïve Bayes algorithm is a simple yet powerful probabilistic classifier based on Bayes’ Theorem. It belongs to the family of Bayesian classifiers, which use probability theory to predict the class of a given input based on prior knowledge and observed data.
Definition
- Prior probability : The initial probability of an event (A) before seeing any new evidence.
- Posterior probability : Probability of the hypothesis (A) after seeing the evidence (B). In simpler words, the updated probability of an event after taking new evidence (B) into account.
Bayes Theorem
Given a hypothesis (Class) and evidence (Data) for this hypothesis, then the probability of given , is: Bayes Theorem (simplified form): Where:
- = Posterior probability (what we want: probability of a class given the data)
- = Likelihood (how likely this data is if we assume the class)
- = Prior probability (how likely the class is in general)
- = Evidence (overall probability of seeing this data)
Example - Medical diagnosis
Given: A doctor knows that:
- Meningitis causes stiff neck 50% of the time
- Prior probability of any patient having meningitis is 1/50 000
- Prior probability of any patient having stiff neck is 1/20
Question: If a patient has a stiff neck, what is the probability that he has meningitis?
Answer
We annotate the question:
Given: A doctor knows that:
- Meningitis causes stiff neck 50% of the time ()
- Prior probability of any patient having meningitis is 1/50 000 ()
- Prior probability of any patient having stiff neck is 1/20 ()
Question: If a patient has a stiff neck, what is the probability that he has meningitis? ()
We use the Bayes Theorem to solve the question:
The patient has 0.0002 chance of having meningitis given that he has a stiff neck.
The Naive Bayes algorithm
The Bayes Theorem isn’t just for calculating probabilities, it can also be used for classification tasks. That’s where the Naïve Bayes algorithm comes in.
Comparison with R1
1R (One Rule) is a simple classifier that makes decisions based on just one attribute (the best-performing one).
In contrast, Naive Bayes:
- Uses all attributes in the data
- Assumes each attribute independently contributes to the final decision
- Treats all attributes as equally important
Assumptions in Naive Bayes
Independence assumption
- It assumes that the attributes (features) are conditionally independent, given the class. In plain terms: knowing the value of one feature tells you nothing about another feature if you know the class
- Example: If class = “spam”, the appearance of the word “free” in an email is assumed to be independent of whether the word “win” appears
For those violates the independence assumption, we apply feature selection beforehand to identify and discard correlated (redundant) attributes.
Equal importance assumption
- It assumes that all features contribute equally to the classification decision.
- Example: The word “free” is considered just as important as “hello” when classifying emails, which may not reflect reality.
Why it’s called “Naive”
Because these assumptions are unrealistic. In real data, features are often dependent (e.g., “free” and “win” often appear together in spam), and some features are clearly more important than others. Yet, despite being naive, the algorithm often performs surprisingly well, especially in text-based applications like spam filtering or sentiment analysis.
Example - Weather
Given: the weather data:
| Outlook | Temp | Humidity | Windy | Play |
|---|---|---|---|---|
| Sunny | Hot | High | False | No |
| Sunny | Hot | High | True | No |
| Overcast | Hot | High | False | Yes |
| Rainy | Mild | High | False | Yes |
| Rainy | Cool | Normal | False | Yes |
| Rainy | Cool | Normal | True | No |
| Overcast | Cool | Normal | True | Yes |
| Sunny | Mild | High | False | No |
| Sunny | Cool | Normal | False | Yes |
| Rainy | Mild | Normal | False | Yes |
| Sunny | Mild | Normal | True | Yes |
| Overcast | Mild | High | True | Yes |
| Overcast | Hot | Normal | False | Yes |
| Rainy | Mild | High | True | No |
| Task: use Naive Bayes to predict the class (yes or no) of the new example: | ||||
outlook=sunny, temperature=cool, humidity=high, windy=true |
The theorem
The Bayes Theorem:
- Identify and
- The evidence is the new example
- The hypothesis is play=yes (and there is another H: play=no)
How to use Naive Bayes for classification
You are to classify a new example by calculating the posterior probability for each class (i.e., Yes or No)
- Identify and
- Calculate for each (class), i.e. and
- Compare them and assign E to the class with the highest probability
Calculate for each
We need to calculate and compare and :
Where = outlook=sunny, temperature=cool, humidity=high, windy=true
Find
Let’s split the evidence into 4 smaller pieces of evidence:
= outlook=sunny, = temperature=cool, = humidity=high, = windy=true
We use Bayes’s independence assumption:
Attributes (features) are conditionally independent, given the class.
Therefore, , , and are independent given the class. Then, their combined probability is obtained by multiplication of per-attribute probabilities:
Comparison
We substitute into the Naive Bayes formula: Since is the same for all classes, we can ignore it for comparison. In another word: we can just compare the numerators, there is no need to calculate .
Calculate the probabilities from the training data
= outlook=sunny, = temperature=cool, = humidity=high, = windy=true
| Outlook | Temp | Humidity | Windy | Play |
|---|---|---|---|---|
| Sunny | Hot | High | False | No |
| Sunny | Hot | High | True | No |
| Overcast | Hot | High | False | Yes |
| Rainy | Mild | High | False | Yes |
| Rainy | Cool | Normal | False | Yes |
| Rainy | Cool | Normal | True | No |
| Overcast | Cool | Normal | True | Yes |
| Sunny | Mild | High | False | No |
| Sunny | Cool | Normal | False | Yes |
| Rainy | Mild | Normal | False | Yes |
| Sunny | Mild | Normal | True | Yes |
| Overcast | Mild | High | True | Yes |
| Overcast | Hot | Normal | False | Yes |
| Rainy | Mild | High | True | No |
Calculate
We are to find:
Answer:
Now we substitute the probabilities into the formula:
Calculate
Similarly, we can obtain the below equation:
Evaluate the comparison
Therefore: We can conclude that: For the new day play = no is more likely than play = yes.
Probability value of zero
Suppose that the training data was different:
outlook=sunny had always occurred together with play=no (i.e. outlook=sunny had never occurred together with play=yes)
Then:
- ; and
This results in the final probability no matter of the other probabilities. This is not good because:
- The other probabilities are completely ignored due to the multiplication with 0
- The prediction for new examples with
outlook=sunnywill always be no, regardless of the values of the other attributes
Laplace correction (Laplace estimator)
Laplace correction is a technique used in Naive Bayes classification to avoid the problem of zero probabilities. Laplace correction assumes that our training data is so large that adding 1 to each count would not make a difference in calculating the probabilities, but it will avoid the case of 0 probability.
What does it do?
Laplace correction adds to the numerator and to the denominator, where is the number of attribute values for the given attribute.
Example
Given a dataset with 2000 examples, 2 classes: buy_Mercedes=yes and buy_Mercedes=no; 1000 examples in each class. 1 of the attributes is income with 3 values: low, medium and high.
For class buy_Mercedes=yes, there are 0 examples with income=low, 10 with income=medium and 990 with income=high.
Without laplace correction
The problem: the zero probability for income = low would cause the entire Naive Bayes probability to go to zero if this value appears.
With laplace correction
Laplace correction adds 1 to the count of each value and adds to the denominator, where = number of attribute values = 3 (low, medium, high). Now the correct probabilities are close to the adjusted probabilities, yet the 0 probability value is avoided.
Laplace correction: M-estimate
The M-estimate is a general technique to avoid zero probabilities (like Laplace), but gives more flexibility and control over the correction. M-estimate adds to the count of each value and adds to the denominator, where is the prior probability of the values of the attribute. is a constant called the equivalent sample size. Note that , where is the number of attribute values. If you don’t have the prior probability or data, use the uniform prior: .
Mercedes example with M-estimate laplace correction
Attribute: income has 3 values (low, medium, high)
- Suppose we estimate:
Now for class buy_Mercedes = yes, you use those priors in:
The value for will be given in the exam.
Handle missing value
The way to handle missing value in Naive Bayes is to not include its attribute.
Example
We are using the previous weather example, but with missing value:
outlook=?, temperature=cool, humidity=high, windy=true
We ignore the outlook in the calculation:
outlook is not included here. As one of the fractions is missing, the probabilities are higher but the comparison is fair, there is a missing fraction in both cases.
Back to parent page: Supervised Machine Learning
AI Machine_Learning COMP3308 Supervised_Learning Eager_Learning Classification Naïve_Bayes Categorical_Attributes Laplace_Correction M-estimate #Missing_Value