Naïve Bayes
Definition:
- Naïve Bayes overview:
- Relationships between input features and class expressed as probabilities.
- Label for sample is class with highest probability given input.
- Naïve Bayes classifier:
- Classification using probability
- Bayes theorem: it makes estimating the probabilities easier.
- Feature independence assumption: For a given class, the value of one feature does not affect the value of any other feature.
- The naïve independence assumption and the use of Bayes theorem gives this classification model its name.
- Probability of event:
- Probability is measure of how likely an event is
- Probability of event ‘A’ occurring:
- Joint probability:
- Probability of events A and B occurring together:
- If the 2 events are independent: P(A, B) = P(A) * P(B)
- Conditional probability:
- Probability of event A occurring, given that event B occurred.
- Event A is conditioned on event B. P(A|B) = P(A,B)/P(B)
It provides the means to specify the probability of a class label, given the input values.
- Probability of events A and B occurring together:
- Bayes’ theorem:
- Relationship between P(B|A) and P(A|B) can be expressed through Bayes’ theorem:
- Relationship between P(B|A) and P(A|B) can be expressed through Bayes’ theorem:
- Classification with probabilities:
Given features X = {X1, X2,……,Xn}, predict class C. Do this by finding value of C that maximizes P(C|X)
- Bayes theorem for classification:
- But estimating P(C|X) is difficult, we should use Bayes’ theorem to simplifies the problem:
- So to get P(C|X), only need to find P(X|C) and P(C).
- Estimating P(C): To estimate P(C), calculate fraction of samples for class C in training data.
- Estimating P(X|C):
- Independence assumption: Features are independent of one another:
P (X1, X2,…Xn|C) = P(X1|C) * P(X2|C) * …….. * P(Xn|C)
- To estimate P(X|C), only need to estimate P(Xi|C) individually
- Independence assumption: Features are independent of one another:
Advantages and Disadvantages:
a. Advantages:
- Naïve Bayes classification:
- Fast and simple: the probabilities that are needed can be calculated with a single scan of the data set and stored in a table.
- Scales well:
- Model building and testing of both task, it scales well.
- Due to the independent assumption:
- The probability for each feature can be independently estimated.
- Featured probability is can be calculated in very low.
- The data set size does not have to grow exponentially with a number of features.
- This avoid the many problems associated with the curse of dimensionality.
- No need to a lot of data to build the model.
- Number of parameters scales linearly with the number of features.
b. Disadvantages:
- The independence assumption may not hold true
- In practice, still works quite well.
- Does not model interactions between features.