Naive Bayes vs Logistic Regression under the lens of probabilistic classifier

Naive Bayes classifier (NB) and Logistic Regression classifier (LR) are two simple algorithms to learn a classifier from data. Just by looking at the dry description of these algorithms, they look very different. LR is learning a set of weights to be multiplied with the features. On the other hand, NB is about counting the occurences of joint events, which is later used as a look up table for classifications. However, they can be seen as two variations under a unified framework – probabilistic classifier. They mostly differ in their choice of representation.

What are probabilistic classifiers anyway? They are data structures that, at inference time, returns:

\begin{equation} \text{argmax}_y P(y \mid \textbf{x}) \end{equation}

as output, where \(\textbf{x}\) is the feature vector of the input, and \(y\) is the potential class label for the input. For now, let’s only consider the binary classification case, where output of the model is one single number – the probability of the positive class, i.e., \(P(y = 1 \mid \textbf{x})\).

NB and LR both follows this framework. Logistic Regression chooses to directly model the conditional distribution \(P(y \mid \textbf{x})\) by learning a set of weights \(\textbf{w}\), then compute the dot product of the weights and the features, then apply a sigmoid function to get the probability of the positive class:

\begin{align} P(y = 1 \mid \textbf{x}) &= \sigma(\textbf{w}^T \textbf{x}) \end{align}

Meanwhile, Naive Bayes, as a generative model, models the joint distribution of the data \(P(\textbf{x}, y)\). Therefore, it transforms \(P(y\mid \textbf{x})\) into what it knows:

\begin{align} P(y \mid \textbf{x}) &= \frac{P(\textbf{x},y)}{P(\textbf{x}} \\
&= \frac{P(\textbf{x}\mid y) P(y)}{P(\textbf{x}} \\
& \propto P(y) P(\textbf{x} \mid y) \\
& = a_y \prod_{i=1}^{n} b_{y,i}^{x_i} \end{align}

where \(a_y\) is the (prior) probability of class \(y\), and \(b_{y,i}\) is the probability of feature \(i\) equals 1 given class \(y\). All of these variables are learned from data, just like \(\textbf{w}\) in NB.

Bonus: This cool Ng and Jordan 2002 paper shows the difference in behaviors of the two models w.r.t. assumption correctness, num params, performance.