Bayes’ theorem has been described as the equivalent to the theory of probability as what the Pythagorean theorem is to geometry^{[1]}. Simply put, Bayes’ theorem describes the probability of a hypothesis, based on evidence that may be related to the hypothesis. For example, you want to determine the probability that a person has cancer given that he smokes. This probability can be determined with Bayes’ theorem by considering the probably that someone smokes given that he has cancer, as well as the probability of getting cancer and the probability that someone smokes (whatever the cause). Mathematically Bayes’ theory is expressed as:

where:

• is the probability that our hypothesis is true given the evidence (*Posterior*).

• is the probability of the evidence given that our hypothesis is true (*Likelihood*).

• is the probability of our hypothesis before the evidence (*Prior*).

• is the probability of the evidence (*Marginal*).

Therefore, our example could be expressed as:

where:

• is the probability that a person has cancer given that he smokes.

• is the probability hat someone smokes given that he has cancer.

• is the probability of someone having cancer.

• is the probability that someone smokes.

In practice, there is interest only in the **numerator** of the equation because the denominator does not depend on the hypothesis, and if the evidence is given, the denominator is effectively constant.

Bayes’ theorem can be extended to probability distributions (to determine confidence intervals and other statistical tests). For example, a survey of 20 people with cancer found that 5 of them smoked. To be safe we create our *Prior* hypothesis with the assumption that smoking has no relationship with cancer (i.e. it is equally likely for a person with cancer to smoke and not to smoke). This can be represented as a uniform random distribution:

#%% Python Learn Function # Import libraries import numpy as np from matplotlib import pyplot as plt #Create our uniform distribution n_draws = 100000 #Number of points to consider in the distribution (should be large) prior = np.random.uniform(0,1,n_draws) #Plot the distribution as a histogram plt.hist(prior,21)

We can now use our prior to generate a binomial distribution for our sample size of 20 people:

gen=np.random.binomial(20,prior) plt.hist(gen,21)

The posterior distribution for our survey can be obtained by selecting the generated points that yielded our result of 5 people:

post=[x for x,y in zip(prior,gen) if y==data] plt.hist(post,21)

If new data is obtained later, the posterior distribution can be used as the prior hypothesis for the new evidence.

The main pros and cons of using Bayesian models in Machine learning are:

Positive |
Negative |

Performs well for very small datasets. | Bayesian models are computationally expensive and slow. |

In machine learning, Bayes’ theorem is primarily used for classification, called **Naive Bayes classifiers**. They are called naive because they make a big assumption:

**Each feature is independent of another.**

This assumption means that given a set of features, we can calculate the probability that they fall into a particular class using the equation:

where

• is the class k

• is the value of feature n

When dealing with continuous data, **Gaussian naive Bayes** can be used, in which it is assumed that each class is distributed according to a Gaussian distribution. Therefore, the probability of each feature value belonging to a class () can be calculated using the mean and variance of the feature data for that class. This equation is given by:

where

• is the variance

• is the mean

Therefore, we can develop our own Gaussian naive Bayes model:

class GaussianBayes: def __init__(self,data,target): #perform error checks and store number of features in data assert type(data) is np.ndarray,"data must be an array" assert data.ndim<=2,"The data should be a 2D or 1D array." self.num_features = 1 if data.ndim==1 else data.shape[1] num_points = len(data) if data.ndim==1 else data.shape[0] assert num_points==len(target), "Number of points in data (%d) is not equal to the number of targets (%d)" % (num_points,len(target)) #Seporate data by class self.separated={} self.seporate(data,target) #record the mean and variance of the data for each class self.model = {i:[np.mean(self.separated[i],axis=0),np.var(self.separated[i],axis=0)] for i in self.separated} def predict(self,data): #perform error checks features_in_data = len(data) if data.ndim==1 else data.shape[1] assert features_in_data==self.num_features, "Number of features in data (%d) not equal to number of features in training data (%d)" % (features_in_data,self.num_features) #create function for Gaussian probability Gaus = lambda val,mean,variance: (1/(2*np.pi*variance)**0.5)*np.exp((-(val-mean)**2)/(2*variance)) class_names = list(self.model.keys()) # return result if only one data point is entered if data.ndim==1: return class_names[np.argmax([np.prod(Gaus(np.array(data),np.array(self.model[i][0]),np.array(self.model[i][1]))) for i in class_names])] #Create results array and add the joint probability for each class for each data point results=np.zeros((data.shape[0],len(class_names))) for class_index in range(len(class_names)): np_mean = np.array(self.model[class_names[class_index]][0]) np_var = np.array(self.model[class_names[class_index]][1]) for k in range(data.shape[0]): np_data = np.array(data[k,:]) joint_propability=np.prod(Gaus(np_data,np_mean,np_var)) results[k,class_index]=joint_propability #return classname for the class with the highest joint probability for each data point return [class_names[i] for i in np.argmax(results,axis=1)] def seporate(self,data,target): for i in range(len(target)): if target[i] not in self.separated: self.separated[target[i]]=[] self.separated[target[i]].append(data[i])

If we evaluate the model with the iris dataset in sklearn we get the following:

from sklearn import datasets iris = datasets.load_iris() y_pred = GaussianBayes(iris.data, iris.target).predict(iris.data) print("Number of mislabeled points from our model out of a total %d points : %d" % (iris.data.shape[0],(iris.target != y_pred).sum()))

Number of mislabeled points from our model out of a total 150 points : 6

Comparing this to the Gaussian naive Bayes model included in sklearn we get:

from sklearn import datasets iris = datasets.load_iris() from sklearn.naive_bayes import GaussianNB gnb = GaussianNB() y_pred = gnb.fit(iris.data, iris.target).predict(iris.data) print("Number of mislabeled points out of a total %d points : %d" ... % (iris.data.shape[0],(iris.target != y_pred).sum()))

Number of mislabeled points from sklearn model out of a total 150 points : 6

Therefore, our model produces the same result as that of the model in sklearn.

For more information on classification algorithms, please see my post:

Comparison of classification algorithms

**References**

**1.** https://en.wikipedia.org/wiki/Bayes%27_theorem