Why Naive Bayes Method Is Naive

2023-09-10

Bayes' theorem is a fundamental block of probability theory and allows us, in simple terms, to express and update our beliefs given new information.

The formula:

P(A|B) = \frac{{P(B|A) \cdot P(A)}}{{P(B)}}

$P(A|B)$ is the [[probability]] of event A given event B is true.
$P(B|A)$ is the probability of event B given event A is true.
$P(A)$ and $P(B)$ are the probabilities of events A and B respectively.

One interesting implication of this method is a Sentiment Analysis. That specific use-case of the Bayes' theorem is called Naive Bayes method. Let's find why.

For this problem, I need a dataset. For example, IMDB 50K as I mentioned in my previous post, or create your own using text samples with sentiment labels, such as "positive," "negative," or "neutral".

The algorithm:

Calculate initial probabilities, based on sentiments distribution ( $P(positive)$ , $P(negative)$ , $P(neutral$ )).
Then, for each word, calculate sentiment probability based on its occurrence in text snippets ( $P(love|positive)$ , $P(terrible|negative$ ), etc.).
Based on that information we can now define a posterior classifier, that will update a sentiment probability: $P(positive|text) = \frac{{P(love|positive) \cdot P(terrible|postive) \cdot \ldots \cdot P(positive)}}{{P(text)}}$ Then the same for neutral and negative.
For calculating $P(text)$ using the law of total probability:

P(text)=P(text∣positive)\cdot P(positive)+P(text∣negative) \cdot P(negative)+P(text∣neutral) \cdot P(neutral)

For calculating $P(text|positive/negative/neutral)$ using a simplification, called bag of words, where we basically assuming that all words in the sentence are independent, ant their only feature is frequency:

P(text∣positive)=P(love∣positive) \cdot P(this∣positive) \cdot P(weather∣positive)

And the same for the rest of the sentiment labels.

The bag of words simplification is exactly why this method is called "naive." It might seem like a shallow assumption, but it turns out to be extremely efficient in practice.