The Naïve Bayes algorithm is a simple yet powerful probabilistic classifier based on Bayes’ Theorem. It belongs to the family of Bayesian classifiers, which use probability theory to predict the class of a given input based on prior knowledge and observed data.

Definition

Prior probability $P (A)$ : The initial probability of an event (A) before seeing any new evidence.
Posterior probability $P (A ∣ B)$ : Probability of the hypothesis (A) after seeing the evidence (B). In simpler words, the updated probability of an event after taking new evidence (B) into account.

Bayes Theorem

Given a hypothesis $H$ (Class) and evidence $E$ (Data) for this hypothesis, then the probability of $H$ given $E$ , is: $P (H ∣ E) = \frac{P ( E ∣ H ) P ( H )}{P ( E )}$ Bayes Theorem (simplified form): $P (Cl a ss ∣ D a t a) = \frac{P ( D a t a ∣ Cl a ss ) P ( Cl a ss )}{P ( D a t a )}$ Where:

$P (Cl a ss ∣ D a t a)$ = Posterior probability (what we want: probability of a class given the data)
$P (D a t a ∣ Cl a ss)$ = Likelihood (how likely this data is if we assume the class)
$P (Cl a ss)$ = Prior probability (how likely the class is in general)
$P (D a t a)$ = Evidence (overall probability of seeing this data)

Example - Medical diagnosis

Given: A doctor knows that：

Meningitis causes stiff neck 50% of the time
Prior probability of any patient having meningitis is 1/50 000
Prior probability of any patient having stiff neck is 1/20

Question: If a patient has a stiff neck, what is the probability that he has meningitis?

Answer

We annotate the question:

Given: A doctor knows that:

Meningitis causes stiff neck 50% of the time ( $P (S ∣ M)$ )
Prior probability of any patient having meningitis is 1/50 000 ( $P (M)$ )
Prior probability of any patient having stiff neck is 1/20 ( $P (S)$ )

Question: If a patient has a stiff neck, what is the probability that he has meningitis? ( $P (M ∣ S) = ?$ )

We use the Bayes Theorem to solve the question:

P (M ∣ S) = \frac{P ( S ∣ M ) P ( M )}{P ( S )} = \frac{0.5 ( 1/50000 )}{1/20} = 0.0002

The patient has 0.0002 chance of having meningitis given that he has a stiff neck.

The Naive Bayes algorithm

The Bayes Theorem isn’t just for calculating probabilities, it can also be used for classification tasks. That’s where the Naïve Bayes algorithm comes in.

Comparison with R1

1R (One Rule) is a simple classifier that makes decisions based on just one attribute (the best-performing one).

In contrast, Naive Bayes:

Uses all attributes in the data
Assumes each attribute independently contributes to the final decision
Treats all attributes as equally important

Assumptions in Naive Bayes

Independence assumption

It assumes that the attributes (features) are conditionally independent, given the class. In plain terms: knowing the value of one feature tells you nothing about another feature if you know the class
Example: If class = “spam”, the appearance of the word “free” in an email is assumed to be independent of whether the word “win” appears

For those violates the independence assumption, we apply feature selection beforehand to identify and discard correlated (redundant) attributes.

Equal importance assumption

It assumes that all features contribute equally to the classification decision.
Example: The word “free” is considered just as important as “hello” when classifying emails, which may not reflect reality.

Why it’s called “Naive”

Because these assumptions are unrealistic. In real data, features are often dependent (e.g., “free” and “win” often appear together in spam), and some features are clearly more important than others. Yet, despite being naive, the algorithm often performs surprisingly well, especially in text-based applications like spam filtering or sentiment analysis.

Example - Weather

Given: the weather data:

Outlook	Temp	Humidity	Windy	Play
Sunny	Hot	High	False	No
Sunny	Hot	High	True	No
Overcast	Hot	High	False	Yes
Rainy	Mild	High	False	Yes
Rainy	Cool	Normal	False	Yes
Rainy	Cool	Normal	True	No
Overcast	Cool	Normal	True	Yes
Sunny	Mild	High	False	No
Sunny	Cool	Normal	False	Yes
Rainy	Mild	Normal	False	Yes
Sunny	Mild	Normal	True	Yes
Overcast	Mild	High	True	Yes
Overcast	Hot	Normal	False	Yes
Rainy	Mild	High	True	No
Task: use Naive Bayes to predict the class (yes or no) of the new example:
`outlook=sunny, temperature=cool, humidity=high, windy=true`

The theorem

The Bayes Theorem: $P (H ∣ E) = \frac{P ( E ∣ H ) P ( H )}{P ( E )}$

Identify $H$ and $E$
- The evidence $E$ is the new example
- The hypothesis $H$ is play=yes (and there is another H: play=no)

How to use Naive Bayes for classification

You are to classify a new example $E$ by calculating the posterior probability for each class $H$ (i.e., Yes or No)

Identify $H$ and $E$
Calculate $P (H ∣ E)$ for each $H$ (class), i.e. $P (yes ∣ E)$ and $P (n o ∣ E)$
Compare them and assign E to the class with the highest probability

Calculate $P (H ∣ E)$ for each $H$

We need to calculate and compare $P (yes ∣ E)$ and $P (n o ∣ E)$ :

$P (yes ∣ E) = \frac{P ( E ∣ yes ) P ( yes )}{P ( E )}$
$P (n o ∣ E) = \frac{P ( E ∣ n o ) P ( n o )}{P ( E )}$

Where $E$ = outlook=sunny, temperature=cool, humidity=high, windy=true

Find $P (E ∣ H)$

Let’s split the evidence $E$ into 4 smaller pieces of evidence: $E_{1}$ = outlook=sunny, $E_{2}$ = temperature=cool, $E_{3}$ = humidity=high, $E_{4}$ = windy=true

We use Bayes’s independence assumption:

Attributes (features) are conditionally independent, given the class.

Therefore, $E_{1}$ , $E_{2}$ , $E_{3}$ and $E_{4}$ are independent given the class. Then, their combined probability is obtained by multiplication of per-attribute probabilities:

$P (E ∣ yes) = P (E_{1} ∣ yes) P (E_{2} ∣ yes) P (E_{3} ∣ yes) P (E_{4} ∣ yes)$
$P (E ∣ n o) = P (E_{1} ∣ n o) P (E_{2} ∣ n o) P (E_{3} ∣ n o) P (E_{4} ∣ n o)$

Comparison

We substitute $P (E ∣ H)$ into the Naive Bayes formula: $P (yes ∣ E) = \frac{P ( E _{1} ∣ yes ) P ( E _{2} ∣ yes ) P ( E _{3} ∣ yes ) P ( E _{4} ∣ yes ) P ( yes )}{P ( E )}$ $P (n o ∣ E) = \frac{P ( E _{1} ∣ n o ) P ( E _{2} ∣ n o ) P ( E _{3} ∣ n o ) P ( E _{4} ∣ n o ) P ( n o )}{P ( E )}$ Since $P (E)$ is the same for all classes, we can ignore it for comparison. In another word: we can just compare the numerators, there is no need to calculate $P (E)$ .

Calculate the probabilities from the training data

$E_{1}$ = outlook=sunny, $E_{2}$ = temperature=cool, $E_{3}$ = humidity=high, $E_{4}$ = windy=true

Outlook	Temp	Humidity	Windy	Play
Sunny	Hot	High	False	No
Sunny	Hot	High	True	No
Overcast	Hot	High	False	Yes
Rainy	Mild	High	False	Yes
Rainy	Cool	Normal	False	Yes
Rainy	Cool	Normal	True	No
Overcast	Cool	Normal	True	Yes
Sunny	Mild	High	False	No
Sunny	Cool	Normal	False	Yes
Rainy	Mild	Normal	False	Yes
Sunny	Mild	Normal	True	Yes
Overcast	Mild	High	True	Yes
Overcast	Hot	Normal	False	Yes
Rainy	Mild	High	True	No

Calculate $P (E ∣ yes)$

$P (yes ∣ E) = \frac{P ( E _{1} ∣ yes ) P ( E _{2} ∣ yes ) P ( E _{3} ∣ yes ) P ( E _{4} ∣ yes ) P ( yes )}{P ( E )}$ We are to find:

$P (E 1∣ yes) = P (o u tl oo k = s u nn y ∣ yes) = ?$
$P (E 2∣ yes) = P (t e m p = coo l ∣ yes) = ?$
$P (E 3∣ yes) = P (h u mi d i t y = hi g h ∣ yes) = ?$
$P (E 4∣ yes) = P (w in d y = t r u e ∣ yes) = ?$
$P (yes) = ?$

Answer:

$P (E 1∣ yes) = P (o u tl oo k = s u nn y ∣ yes) = 2/9$
$P (E 2∣ yes) = P (t e m p = coo l ∣ yes) = 3/9$
$P (E 3∣ yes) = P (h u mi d i t y = hi g h ∣ yes) = 3/9$
$P (E 4∣ yes) = P (w in d y = t r u e ∣ yes) = 3/9$
$P (yes) = 9/14$

Now we substitute the probabilities into the formula: $P (yes ∣ E) = \frac{\frac{2}{9} \frac{3}{9} \frac{3}{9} \frac{3}{9} \frac{9}{14}}{P ( E )} = \frac{0.0053}{P ( E )}$

Calculate $P (E ∣ n o)$

Similarly, we can obtain the below equation: $P (n o ∣ E) = \frac{\frac{3}{5} \frac{1}{5} \frac{4}{5} \frac{3}{5} \frac{5}{14}}{P ( E )} = \frac{0.0206}{P ( E )}$

Evaluate the comparison

$\frac{0.0053}{P ( E )} < \frac{0.0206}{P ( E )}$ Therefore: $P (n o ∣ E) > P (yes ∣ E)$ We can conclude that: For the new day play = no is more likely than play = yes.

Probability value of zero

Suppose that the training data was different: outlook=sunny had always occurred together with play=no (i.e. outlook=sunny had never occurred together with play=yes)

Then:

$P (o u tl oo k = s u nn y ∣ yes) = 0$ ; and
$P (o u tl oo k = s u nn y ∣ n o) = 1$

$P (yes ∣ E) = \frac{0 \times P ( E _{2} ∣ yes ) P ( E _{3} ∣ yes ) P ( E _{4} ∣ yes ) P ( yes )}{P ( E )}$ This results in the final probability $P (yes ∣ E) = 0$ no matter of the other probabilities. This is not good because:

The other probabilities are completely ignored due to the multiplication with 0
The prediction for new examples with outlook=sunny will always be no, regardless of the values of the other attributes

Laplace correction (Laplace estimator)

Laplace correction is a technique used in Naive Bayes classification to avoid the problem of zero probabilities. Laplace correction assumes that our training data is so large that adding 1 to each count would not make a difference in calculating the probabilities, but it will avoid the case of 0 probability.

What does it do?

Laplace correction adds $1$ to the numerator and $k$ to the denominator, where $k$ is the number of attribute values for the given attribute.

Example

Given a dataset with 2000 examples, 2 classes: buy_Mercedes=yes and buy_Mercedes=no; 1000 examples in each class. 1 of the attributes is income with 3 values: low, medium and high. For class buy_Mercedes=yes, there are 0 examples with income=low, 10 with income=medium and 990 with income=high.

Without laplace correction

$P (l o w ∣ yes) = \frac{0}{1000} = 0$ $P (m e d i u m ∣ yes) = \frac{10}{1000} = 0.01$ $P (hi g h ∣ yes) = \frac{990}{1000} = 0.99$ The problem: the zero probability for income = low would cause the entire Naive Bayes probability to go to zero if this value appears.

With laplace correction

Laplace correction adds 1 to the count of each value and adds $k$ to the denominator, where $k$ = number of attribute values = 3 (low, medium, high). $P (l o w ∣ yes) = \frac{0 + 1}{1000 + 3} = \frac{1}{1003} \approx 0.001$ $P (m e d i u m ∣ yes) = \frac{10 + 1}{1000 + 3} = \frac{11}{1003} \approx 0.011$ $P (hi g h ∣ yes) = \frac{990 + 1}{1000 + 3} = \frac{991}{1003} \approx 0.988$ Now the correct probabilities are close to the adjusted probabilities, yet the 0 probability value is avoided.

Laplace correction: M-estimate

The M-estimate is a general technique to avoid zero probabilities (like Laplace), but gives more flexibility and control over the correction. M-estimate adds $m \cdot p_{i}$ to the count of each value and adds $m$ to the denominator, where $p_{i}$ is the prior probability of the $i$ values of the attribute. $m$ is a constant called the equivalent sample size. Note that $p_{1} + p_{2} + ... + p_{n} = 1$ , where $n$ is the number of attribute values. If you don’t have the prior probability or data, use the uniform prior: $p_{i} = \frac{1}{n}$ .

Mercedes example with M-estimate laplace correction

Attribute: income has 3 values (low, medium, high)

Suppose we estimate:
- $p_{l o w} = 0.2$
- $p_{m e d i u m} = 0.3$
- $p_{hi g h} = 0.5$

Now for class buy_Mercedes = yes, you use those priors in: $P (l o w ∣ yes) = \frac{0 + m \cdot 0.2}{1000 + m}$ The value for $m$ will be given in the exam.

Handle missing value

The way to handle missing value in Naive Bayes is to not include its attribute.

Example

We are using the previous weather example, but with missing value: outlook=?, temperature=cool, humidity=high, windy=true

We ignore the outlook in the calculation: $P (yes ∣ E) = \frac{\frac{3}{9} \frac{3}{9} \frac{3}{9} \frac{9}{14}}{P ( E )} = \frac{0.0053}{P ( E )}$ $P (n o ∣ E) = \frac{\frac{1}{5} \frac{4}{5} \frac{3}{5} \frac{5}{14}}{P ( E )} = \frac{0.0206}{P ( E )}$ outlook is not included here. As one of the fractions is missing, the probabilities are higher but the comparison is fair, there is a missing fraction in both cases.

Back to parent page: Supervised Machine Learning

AI Machine_Learning COMP3308 Supervised_Learning Eager_Learning Classification Naïve_Bayes Categorical_Attributes Laplace_Correction M-estimate #Missing_Value

Computer Science Wiki

Explorer

Naïve Bayes for Categorical Attributes

Definition

Bayes Theorem

Example - Medical diagnosis

Answer

The Naive Bayes algorithm

Comparison with R1

Assumptions in Naive Bayes

Independence assumption

Equal importance assumption

Why it’s called “Naive”

Example - Weather

The theorem

How to use Naive Bayes for classification

Calculate $P (H ∣ E)$ for each $H$

Find $P (E ∣ H)$

Comparison

Calculate the probabilities from the training data

Calculate $P (E ∣ yes)$

Calculate $P (E ∣ n o)$

Evaluate the comparison

Probability value of zero

Laplace correction (Laplace estimator)

What does it do?

Example

Without laplace correction

With laplace correction

Laplace correction: M-estimate

Mercedes example with M-estimate laplace correction

Handle missing value

Example

Graph View

Table of Contents

Backlinks

Computer Science Wiki

Explorer

Naïve Bayes for Categorical Attributes

Definition

Bayes Theorem

Example - Medical diagnosis

Answer

The Naive Bayes algorithm

Comparison with R1

Assumptions in Naive Bayes

Independence assumption

Equal importance assumption

Why it’s called “Naive”

Example - Weather

The theorem

How to use Naive Bayes for classification

Calculate P(H∣E) for each H

Find P(E∣H)

Comparison

Calculate the probabilities from the training data

Calculate P(E∣yes)

Calculate P(E∣no)

Evaluate the comparison

Probability value of zero

Laplace correction (Laplace estimator)

What does it do?

Example

Without laplace correction

With laplace correction

Laplace correction: M-estimate

Mercedes example with M-estimate laplace correction

Handle missing value

Example

Graph View

Table of Contents

Backlinks

Calculate $P (H ∣ E)$ for each $H$

Find $P (E ∣ H)$

Calculate $P (E ∣ yes)$

Calculate $P (E ∣ n o)$