Probability Theory – Naive Bayes

Bayes’ theorem has been described as the equivalent to the theory of probability as what the Pythagorean theorem is to geometry[1]. Simply put, Bayes’ theorem describes the probability of a hypothesis, based on evidence that may be related to the hypothesis. For example, you want to determine the probability that a person has cancer given that he smokes. This probability can be determined with Bayes’ theorem by considering the probably that someone smokes given that he has cancer, as well as the probability of getting cancer and the probability that someone smokes (whatever the cause). Mathematically Bayes’ theory is expressed as:

    \[ P(H|e)=\frac{P(e|H)P(H)}{P(e)} \]

• P(H|e) is the probability that our hypothesis is true given the evidence (Posterior).
• P(e|H) is the probability of the evidence given that our hypothesis is true (Likelihood).
• P(H) is the probability of our hypothesis before the evidence (Prior).
• P(e) is the probability of the evidence (Marginal). P(e) = ∑P(e|H_i)P(H_i)

Therefore, our example could be expressed as:

    \[ P(Cancer|Smokes)=\frac{P(Smokes|Cancer)P(Cancer)}{P(Smokes)} \]

• P(Cancer|Smokes) is the probability that a person has cancer given that he smokes.
• P(Smokes|Cancer) is the probability hat someone smokes given that he has cancer.
• P(Cancer) is the probability of someone having cancer.
• P(Smokes) is the probability that someone smokes.

In practice, there is interest only in the numerator of the equation because the denominator does not depend on the hypothesis, and if the evidence is given, the denominator is effectively constant.

Bayes’ theorem can be extended to probability distributions (to determine confidence intervals and other statistical tests). For example, a survey of 20 people with cancer found that 5 of them smoked. To be safe we create our Prior hypothesis with the assumption that smoking has no relationship with cancer (i.e. it is equally likely for a person with cancer to smoke and not to smoke). This can be represented as a uniform random distribution:

#%% Python Learn Function
# Import libraries
import numpy as np
from matplotlib import pyplot as plt
#Create our uniform distribution
n_draws = 100000 #Number of points to consider in the distribution (should be large)
prior = np.random.uniform(0,1,n_draws)
#Plot the distribution as a histogram

Uniform distribution

We can now use our prior to generate a binomial distribution for our sample size of 20 people:


Uniform binomial distribution

The posterior distribution for our survey can be obtained by selecting the generated points that yielded our result of 5 people:

post=[x for x,y in zip(prior,gen) if y==data]

Posterior distribution

If new data is obtained later, the posterior distribution can be used as the prior hypothesis for the new evidence.

The main pros and cons of using Bayesian models in Machine learning are:

Positive Negative
Performs well for very small datasets. Bayesian models are computationally expensive and slow.

In machine learning, Bayes’ theorem is primarily used for classification, called Naive Bayes classifiers. They are called naive because they make a big assumption:

Each feature is independent of another.

This assumption means that given a set of features, we can calculate the probability that they fall into a particular class using the equation:

    \[ P(C_k|x_1,x_2,\dots,x_n)=P(C_k)×P(x_1|C_k)×P(x_2|C_k)×\dots×P(x_n|C_k) \]

• C_k is the class k
• x_n is the value of feature n

When dealing with continuous data, Gaussian naive Bayes can be used, in which it is assumed that each class is distributed according to a Gaussian distribution. Therefore, the probability of each feature value belonging to a class (P(x_n|C_k)) can be calculated using the mean and variance of the feature data for that class. This equation is given by:

    \[ P(x_n|C_k)= \frac{1}{\sqrt{2\pi\sigma^2}}exp\bigg(\frac{-(6-\mu)^2}{2\sigma^2}\bigg) \]

• \sigma^2 is the variance
• \mu is the mean

Therefore, we can develop our own Gaussian naive Bayes model:

class GaussianBayes:
    def __init__(self,data,target):
        #perform error checks and store number of features in data
        assert type(data) is np.ndarray,"data must be an array"
        assert data.ndim<=2,"The data should be a 2D or 1D array."
        self.num_features = 1 if data.ndim==1 else data.shape[1]
        num_points = len(data) if data.ndim==1 else data.shape[0]
        assert num_points==len(target), "Number of points in data (%d) is not equal to the number of targets (%d)" % (num_points,len(target))
        #Seporate data by class
        #record the mean and variance of the data for each class
        self.model = {i:[np.mean(self.separated[i],axis=0),np.var(self.separated[i],axis=0)] for i in self.separated}
    def predict(self,data):
        #perform error checks
        features_in_data = len(data) if data.ndim==1 else data.shape[1]
        assert features_in_data==self.num_features, "Number of features in data (%d) not equal to number of features in training data (%d)" % (features_in_data,self.num_features)
        #create function for Gaussian probability
        Gaus = lambda val,mean,variance: (1/(2*np.pi*variance)**0.5)*np.exp((-(val-mean)**2)/(2*variance))
        class_names = list(self.model.keys())
        # return result if only one data point is entered
        if data.ndim==1: return class_names[np.argmax([,np.array(self.model[i][0]),np.array(self.model[i][1]))) for i in class_names])]
        #Create results array and add the joint probability for each class for each data point
        for class_index in range(len(class_names)):
            np_mean = np.array(self.model[class_names[class_index]][0])
            np_var = np.array(self.model[class_names[class_index]][1])
            for k in range(data.shape[0]):
                np_data = np.array(data[k,:])
        #return classname for the class with the highest joint probability for each data point
        return [class_names[i] for i in np.argmax(results,axis=1)]
    def seporate(self,data,target):
        for i in range(len(target)):
            if target[i] not in self.separated:

If we evaluate the model with the iris dataset in sklearn we get the following:

from sklearn import datasets
iris = datasets.load_iris()
y_pred = GaussianBayes(,
print("Number of mislabeled points from our model out of a total %d points : %d"
      % ([0],( != y_pred).sum()))
Number of mislabeled points from our model 
     out of a total 150 points : 6

Comparing this to the Gaussian naive Bayes model included in sklearn we get:

from sklearn import datasets
iris = datasets.load_iris()
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred =,
print("Number of mislabeled points out of a total %d points : %d"
...       % ([0],( != y_pred).sum()))
Number of mislabeled points from sklearn model 
    out of a total 150 points : 6

Therefore, our model produces the same result as that of the model in sklearn.

For more information on classification algorithms, please see my post:
Comparison of classification algorithms