Expectation Maximization with Bayesian Networks

A learning note about an algorithm that puts two complicated things together

There are some fancy jargons that sound very attractive to me – “Bayesian”, “graphical model”, “optimization”, “probabilistic”, etc. I like to learn those things because those things are spoken by people smarter than me. It turns out my course has those stuffs. Today I will talk about Bayesian networks, EM, and their combination. (Feeling so fancy!)

Bayesian networks

Bayesian networks, or “Bayes nets”, is a type of probabilistic model that is as expressive as a full-sized joint probability table, but is usually much more compact, thus much more tractable to implement. As such, it is used to model the probabilities of events. This simple-sounding function is actually very generic. You can use such a probabilistic model to generate data (generative model, such as in generative AI), or to do classification (using the probabilities).

Representation

Its representation contains two components: a graph and a couple of tables.

The graph has the random variables as nodes, and their direct probabilisticly causual relationships as direct edges. It must be a DAG. Therefore, we can define parent-child relationships. We will denote \(pa(u)\) as the set of parent nodes of a node \(u\). More importantly, the graph encodes conditional independence between variables, which is the key for its compactness. Kevin Murphy in his textbook stated that Bayes net should have been called “independence diagrams” to be more precise.

Next, each node \(u\) in the graph is accompanied by a table, called the Conditional Probability Table (CPT). It stores the parameters for the probability distribution of the corresponding random variable, conditioned on each assignment of all variables in \(pa(u)\). For example, consider a node \(A\) with \(pa(A)=\{B, C\}\):

If all variables are binary, then the CPTs would encode the Bernouli distribution of \(u\) for each of the four assignments of \((B, C)\). For Bernouli, you only need on number to encode the distribution, which is the probability of the positive event. So the table would look like below (with ? replaced by actual probabilities):

B	C	P(A\|B,C)
0	0	?
0	1	?
1	0	?
1	1	?

If all variables are discrete with domain \(\{1, 2, ..., K\}\), then the CPT of \(A\) would have \(K^2\) rows, each for one assignment of \((B, C)\), and \(K-1\) columns (except the first two for \(B\) and \(C\)), storing the probabilities \(P(A=i\mid b, c) \forall 1 \leq i \leq K-1\). In other words, we are modeling categorical distributions of \(A\).
If all variables are continuous, \(A\) must be modeled by some continuous distribution such as the Guassian. In that case, the \(\mu\) and \(\sigma\) of the Gaussian distribution will be functions of \(B\) and \(C\).

Finally, this class of models is also called graphical models and belief networks, which Murphy all claimed to be bad names.

So Bayes net is a generalization of Naive Bayes models and a compact equivalence of the full joint probability table.

Inference

How to do inference with a Bayes net? Recall that inference in probabilistic models is the task of returning the probability of a (joint) (conditioned) envent, such as \(P(A)\), \(P(A\mid B)\), or \(P(A,C\mid B)\). In principle, because Bayes net is as expressive as a full joint probability table, we can just convert the Bayes net to that table and do inference using marginalization and definition of conditional probabilities. However, that would defeat the whole purpose of compressing things into this neat network.

Here comes Variable elimination – an exact inference algorithm for Bayes nets. Can be exponential in time and space if the network is too wide at any node, but on average it should work alright. Refer to this lecture from IIT Dehli (2018) for an excellent explanation of the algorithm in under 30 minutes.¹

Expectation Maximization

Now, EM is an algorithm to fill in missing data in a dataset in such a way that (1) is based on assumptions about the world by a probabilistic generative model and (2) updates the internal parameter of the model as well. I won’t define it here because it should be simple to read it up (unlike VE).

Note that, EM is not the only algorithm to do those two things. In fact, it is an interative algorithm that bases on random initialization of paramesters, such is sensitive to the initialization. The cleanest approach would be an analytical solution to find both the parameters of the model and the missing values, but that does not exist.

The combination!

The cool thing is that EM and Bayes can marry each other! Starting with an initalized (or even pre-trained) Bayes net and a dataset with missing values, we can use EM to both (1) finetune the Bayes net using the non-missing values in the dataset, and (2) probabilistically fill in the missing values in the dataset! What a beautiful procedure!

Here I think the beauty lies in the fact that the two things are both quite complicated to me. But their interfaces are matching enough that they can work with each other and combine the strengths of both. I think that’s the beauty of data structures and algorithms design, and of theoretical computer science.

VE was taught at my ML class, I was out of town. I was hoping to catch up using class materials but (1) the lecture slides do not contain enough information, and (2) the Murphy’s textbook goes for a highly convoluted path to define VE. It uses some “v-slashed” notation, which is from undirected graphic models concerning clique, which means some kind of “potential”. It is definitely not what I need. ↩