[algorithm] A simple explanation of Naive Bayes Classification

I try to explain the Bayes rule with an example.

What is the chance that a random person selected from the society is a smoker?

You may reply 10%, and let's assume that's right.

Now, what if I say that the random person is a man and is 15 years old?

You may say 15 or 20%, but why?.

In fact, we try to update our initial guess with new pieces of evidence ( P(smoker) vs. P(smoker | evidence) ). The Bayes rule is a way to relate these two probabilities.

P(smoker | evidence) = P(smoker)* p(evidence | smoker)/P(evidence)

Each evidence may increase or decrease this chance. For example, this fact that he is a man may increase the chance provided that this percentage (being a man) among non-smokers is lower.

In the other words, being a man must be an indicator of being a smoker rather than a non-smoker. Therefore, if an evidence is an indicator of something, it increases the chance.

But how do we know that this is an indicator?

For each feature, you can compare the commonness (probability) of that feature under the given conditions with its commonness alone. (P(f | x) vs. P(f)).

P(smoker | evidence) / P(smoker) = P(evidence | smoker)/P(evidence)

For example, if we know that 90% of smokers are men, it's not still enough to say whether being a man is an indicator of being smoker or not. For example if the probability of being a man in the society is also 90%, then knowing that someone is a man doesn't help us ((90% / 90%) = 1. But if men contribute to 40% of the society, but 90% of the smokers, then knowing that someone is a man increases the chance of being a smoker (90% / 40%) = 2.25, so it increases the initial guess (10%) by 2.25 resulting 22.5%.

However, if the probability of being a man was 95% in the society, then regardless of the fact that the percentage of men among smokers is high (90%)! the evidence that someone is a man decreases the chance of him being a smoker! (90% / 95%) = 0.95).

So we have:

P(smoker | f1, f2, f3,... ) = P(smoker) * contribution of f1* contribution of f2 *... 
=
P(smoker)* 
(P(being a man | smoker)/P(being a man))*
(P(under 20 | smoker)/ P(under 20))

Note that in this formula we assumed that being a man and being under 20 are independent features so we multiplied them, it means that knowing that someone is under 20 has no effect on guessing that he is man or woman. But it may not be true, for example maybe most adolescence in a society are men...

To use this formula in a classifier

The classifier is given with some features (being a man and being under 20) and it must decide if he is an smoker or not (these are two classes). It uses the above formula to calculate the probability of each class under the evidence (features), and it assigns the class with the highest probability to the input. To provide the required probabilities (90%, 10%, 80%...) it uses the training set. For example, it counts the people in the training set that are smokers and find they contribute 10% of the sample. Then for smokers checks how many of them are men or women .... how many are above 20 or under 20....In the other words, it tries to build the probability distribution of the features for each class based on the training data.

Examples related to algorithm

How can I tell if an algorithm is efficient? Find the smallest positive integer that does not occur in a given sequence Efficiently getting all divisors of a given number Peak signal detection in realtime timeseries data What is the optimal algorithm for the game 2048? How can I sort a std::map first by value, then by key? Finding square root without using sqrt function? Fastest way to flatten / un-flatten nested JSON objects Mergesort with Python Find common substring between two strings

Examples related to machine-learning

Error in Python script "Expected 2D array, got 1D array instead:"? How to predict input image using trained model in Keras? What is the role of "Flatten" in Keras? How to concatenate two layers in keras? How to save final model using keras? scikit-learn random state in splitting dataset Why binary_crossentropy and categorical_crossentropy give different performances for the same problem? What is the meaning of the word logits in TensorFlow? Can anyone explain me StandardScaler? Can Keras with Tensorflow backend be forced to use CPU or GPU at will?

Examples related to dataset

How to convert a Scikit-learn dataset to a Pandas dataset? Convert floats to ints in Pandas? Convert DataSet to List C#, Looping through dataset and show each record from a dataset column How to add header to a dataset in R? How I can filter a Datatable? Stored procedure return into DataSet in C# .Net Looping through a DataTable adding a datatable in a dataset How to fill Dataset with multiple tables?

Examples related to classification

FailedPreconditionError: Attempting to use uninitialized in Tensorflow Scikit-learn train_test_split with indices Scikit-learn: How to obtain True Positive, True Negative, False Positive and False Negative What are advantages of Artificial Neural Networks over Support Vector Machines? Save classifier to disk in scikit-learn A simple explanation of Naive Bayes Classification Difference between classification and clustering in data mining?

Examples related to naivebayes

A simple explanation of Naive Bayes Classification