Chapter 19 Motivation for Latent Variable Models

19.0.0.1 A Model for Birth Weights

Recall our model for birth weigths, \(Y_1,\ldots, Y_N\). We posited that the birth weights are iid normally distributed with known \(\sigma^2\), \(Y_n \sim \mathcal{N}(\mu, 1)\).

Compare the maximum likelihood model and the Bayesian model for bith weight. Which model would you use to make clinical decisions? What’s hard about this comparison?

19.0.0.2 A Similarity Measure for Distributions: Kullback–Leibler Divergence

Visually comparing models to the empirical distribution of the data is impractical. Fortunately, there are a large number of quantitative measures for comparing two distributions, these are called divergence measures. For example, the Kullback–Leibler (KL) Divergence is defined for two distributions \(p(\theta)\) and \(q(\theta)\) supported on \(\Theta\) as:

\[ D_{\text{KL}}[q \,\|\, p] = \int_{\Theta} \log\left[\frac{q(\theta)}{p(\theta)} \right] q(\theta)d\theta \]

The KL-divergence \(D_{\text{KL}}[q \,\|\, p]\) is bounded below by 0, which happens if and only if \(q=p\). The KL-divergence has information theoretic interpretations that we will explore later in the course.

Note: The KL-divergence is defined in terms of the pdf’s of \(p\) and \(q\). If \(p\) is a distribution from which we only have samples and not the pdf (like the empirical distribution), we can nontheless estimate \(D_{\text{KL}}[q \,\|\, p]\). Techniques that estimate the KL-divergence from samples are called non-parametric. We will use them later in the course.

(OPTIONAL!!!) Why is the KL bounded below by 0?

First let’s see why the answer isn’t obvious. Recall that the KL divergence is the expected log ratio between two distribution:

\[ D_{\text{KL}} [q\| p] = \mathbb{E}_{q}\left[ \log > \frac{q}{p}\right] \]

Now, we know that when \(q\) is less than \(p\) (i.e. \(q/p < 1\)) then the log can be an arbitrarily negative number. So it’s not immediately obvious that the expected value of this fraction should always be non-negative!

An Intuitive Explanation

Let the blue curve be q and the red be p. We have \(q < p\) from \((-\infty, 55)\), on this part of the domain \(\log(q/p)\) is negative. On \([55, \infty)\), \(\log(q/p)\) is nonnegative.

However, since we are sampling from \(q\), and \(q\)’s mass is largely over \([55, \infty)\), the log fraction \(\log(q/p\)) will tend to be nonnegative.

A Formal Argument

There are many proofs of the non-negativity of the KL. Ranging from the very complex to the very simple. Here is one that just involves a bit of algebra:

We want to show that \(D_{\text{KL}}[q\|p] \geq 0\). Instead we’ll show, equivalently, that \(-D_{\text{KL}}[q\|p] \leq 0\) (we’re choosing show the statement about the negative KL, just so we can flip the fraction on the inside of the log and cancel terms):

19.1 Latent Variable Models

Models that include an observed variable \(Y\) and at least one unobserved variable \(Z\) are called latent variable models. In general, our model can allow \(Y\) and \(Z\) to interact in many different ways. Today, we will study models with one type of interaction:

19.1.1 Example: Gaussian Mixture Models (GMMs)

In a Gaussian Mixture Model (GMM), we posit that the observed data \(Y\) is generated by a mixture, \(\pi=[\pi_1, \ldots, \pi_K]\), of \(K\) number of Gaussians with means \(\mu = [\mu_1, \ldots, \mu_K]\) and covariances \(\Sigma = [\Sigma_1, \ldots, \Sigma_K]\). For each observation \(Y_n\) the class of the observation \(Z_n\) is a latent variable that indicates which of the \(K\) Gaussian is responsible for generating \(Y_n\):

\[\begin{aligned} Z_n &\sim Cat(\pi),\\ Y_n | Z_n&\sim \mathcal{N}(\mu_{Z_n}, \Sigma_{Z_n}), \end{aligned}\]

where \(n=1, \ldots, N\) and \(\sum_{k=1}^K \pi_k = 1\).

GMMs are examples of model based clustering - breaking up a data set into natural clusters based on a statistical model fitted to the data.

19.1.2 Item-Response Models

In item-response models, we measure an real-valued unobserved trait \(Z\) of a subject by performing a series of experiments with binary observable outcomes, \(Y\):

\[\begin{aligned} Z_n &\sim \mathcal{N}(\mu, \sigma^2),\\ \theta_n &= g(Z_n)\\ Y_n|Z_n &\sim Ber(\theta_n), \end{aligned}\]

where \(n=1, \ldots, N\) and \(g\) is some fixed function of \(Z_n\).

19.1.2.1 Applications

Item response models are used to model the way “underlying intelligence” \(Z\) relates to scores \(Y\) on IQ tests.

Item response models can also be used to model the way “suicidality” \(Z\) relates to answers on mental health surveys. Building a good model may help to infer when a patient is at psychiatric risk based on in-take surveys at points of care through out the health-care system.

19.1.3 Example: Factor Analysis Models

In factor analysis models, we posit that the observed data \(Y\) with many measurements is generated by a small set of unobserved factors \(Z\):

\[\begin{aligned} Z_n &\sim \mathcal{N}(0, I),\\ Y_n|Z_n &\sim \mathcal{N}(\mu + \Lambda Z_n, \Phi), \end{aligned}\]

where \(n=1, \ldots, N\), \(Z_n\in \mathbb{R}^{D'}\) and \(Y_n\in \mathbb{R}^{D}\). We typically assume that \(D'\) is much smaller than \(D\).

19.1.3.1 Applications

Factor analysis models are useful for biomedical data, where we typically measure a large number of characteristics of a patient (e.g. blood pressure, heart rate, etc), but these characteristics are all generated by a small list of health factors (e.g. diabetes, cancer, hypertension etc). Building a good model means we may be able to infer the list of health factors of a patient from their observed measurements.


19.2 Maximum Likelihood Estimation for Latent Variable Models: Expectation Maximization

Given a latent variable model \(p(Y, Z| \phi, \theta) = p(Y | Z, \phi) p(Z|\theta)\), we are interested computing the MLE of parameters \(\phi\) and \(\theta\):

\[\begin{aligned} \theta_{\text{MLE}}, \phi_{\text{MLE}} &= \underset{\theta, \phi}{\mathrm{argmax}}\; \ell(\theta, \phi)\\ &= \underset{\theta, \phi}{\mathrm{argmax}}\; \log \prod_{n=1}^N \int_{\Omega_Z} p(y_n, z_n | \theta, \phi) dz\\ &= \underset{\theta, \phi}{\mathrm{argmax}}\; \log \prod_{n=1}^N \int_{\Omega_Z} p(y_n| z_n, \phi)p(z_n| \theta) dz \end{aligned}\]

where \(\Omega_Z\) is the domain of \(Z\). Why is this an hard optimization problem?

There are two major problems: 1. the product in the integrand 2. gradients cannot be past the integral (i.e. we cannot easily compute the gradient to solve the optimization problem).

We solve these two problems by: 1. pushing the log past the integral so that it can be applied to the integrand (Jensen’s Inequality) 2. introducing an auxiliary variables \(q(Z_n)\) to allow the gradient to be pushed past the integral.

\[\begin{aligned} \underset{\theta, \phi}{\mathrm{max}}\; \ell(\theta, \phi) &= \underset{\theta, \phi, q}{\mathrm{max}}\; \log \prod_{n=1}^N\int_{\Omega_Z} \left(\frac{p(y_n, z_n|\theta, \phi)}{q(z_n)}q(z_n)\right) dz\\ &= \underset{\theta, \phi, q}{\mathrm{max}}\; \log\,\prod_{n=1}^N\mathbb{E}_{Z\sim q(Z)} \left[ \frac{p(y_n, Z|\theta, \phi)}{q(Z)}\right]\\ &= \underset{\theta, \phi, q}{\mathrm{max}}\; \sum_{n=1}^N \log \mathbb{E}_{Z\sim q(Z)} \left[\,\left( \frac{p(y_n, Z|\theta, \phi)}{q(Z)}\right)\right]\\ &\geq \underset{\theta, \phi, q}{\mathrm{max}}\; \underbrace{\sum_{n=1}^N\mathbb{E}_{Z_n\sim q(Z)} \left[ \log\,\left(\frac{p(y_n, Z_n|\theta, \phi)}{q(Z_n)}\right)\right]}_{ELBO(\theta, \phi)}, \quad (\text{Jensen's Inequality})\\ \end{aligned}\]

We call \(\sum_{n=1}^N\mathbb{E}_{Z_n\sim q(Z)} \left[ \log\,\left(\frac{p(y_n, Z_n|\theta, \phi)}{q(Z)}\right)\right]\) the Evidence Lower Bound (ELBO). Note that maximizing the ELBO will yield a lower bound of the maximum value of the log likelihood. Although the optimal point of the ELBO may not be the optimal point of the log likelihood, we nontheless prefer to optimize the ELBO because the gradients, with respect to \(\theta, \phi\), of the ELBO are easier to compute:

\[\begin{aligned} \nabla_{\theta, \phi} ELBO(\theta, \phi) &= \nabla_{\theta, \phi}\left[ \sum_{n=1}^N\mathbb{E}_{Z_n\sim q(Z)} \left[ \log\,\left(\frac{p(y_n, Z_n|\theta, \phi)}{q(Z_n)}\right)\right]\right] \\ &= \sum_{n=1}^N\mathbb{E}_{Z_n\sim q(Z)} \left[ \nabla_{\theta, \phi} \left( \log\,\left(\frac{p(y_n, Z_n|\theta, \phi)}{q(Z_n)}\right)\right)\right] \end{aligned}\]

Note that we can push the gradient \(\nabla_{\theta, \phi}\) past the expectation \(\mathbb{E}_{Z_n\sim q(Z)}\) since the expectation is not computed with respect to our optimization variables!

Rather than optimizing the ELBO over all variables \(\theta, \phi, q\) (this would be hard), we optimize one set of variables at a time:

19.2.0.1 Step I: the M-step

Optimize the ELBO with respect to \(\theta, \phi\):

\[\begin{aligned} \theta^*, \phi^* = \underset{\theta, \phi}{\mathrm{max}}\; ELBO(\theta, \phi, q) &= \underset{\theta, \phi}{\mathrm{max}}\; \sum_{n=1}^N\mathbb{E}_{Z_n\sim q(Z)} \left[ \log\,\left(\frac{p(y_n, Z_n|\theta, \phi)}{q(Z_n)}\right)\right]\\ &= \underset{\theta, \phi}{\mathrm{max}}\; \sum_{n=1}^N \int_{\Omega_Z} \log\,\left(\frac{p(y_n, z_n|\theta, \phi)}{q(z_n)}\right)q(z_n) dz_n\\ &= \underset{\theta, \phi}{\mathrm{max}}\; \sum_{n=1}^N \int_{\Omega_Z} \log\,\left(p(y_n, z_n|\theta, \phi)\right) q(z_n)dz_n - \underbrace{\int_{\Omega_Z} \log \left(q(z_n)\right)q(z_n) dz_n}_{\text{constant with respect to }\theta, \phi}\\ &\equiv \underset{\theta, \phi}{\mathrm{max}}\;\sum_{n=1}^N \int_{\Omega_Z} \log\,\left(p(y_n, z_n|\theta, \phi)\right) q(z_n)dz_n\\ &= \underset{\theta, \phi}{\mathrm{max}}\;\sum_{n=1}^N \mathbb{E}_{Z_n\sim q(Z)} \left[ \log\left(p(y_n, z_n|\theta, \phi)\right)\right] \end{aligned}\]

19.2.0.2 Step II: the E-step

Optimize the ELBO with respect to \(q\):

\[\begin{aligned} q^*(Z_n) = \underset{q}{\mathrm{argmax}}\;\left(\underset{\theta, \phi}{\mathrm{argmax}}\; ELBO(\theta, \phi, q) \right) = \underset{q}{\mathrm{argmax}}\; ELBO(\theta^*, \phi^*, q) \end{aligned}\]

Rather than optimizing the ELBO with respect to \(q\), which seems hard, we will argue that optimizing the ELBO is equivalent to optimizing another function of \(q\), one whose optimum is easy for us to compute.

Note: We can recognize the difference between the log likelihood and the ELBO as a function we’ve seen:

\[\begin{aligned} \ell(\theta, \phi) - ELBO(\theta, \phi, q) &= \sum_{n=1}^N \log p(y_n| \theta, \phi) - \sum_{n=1}^N \int_{\Omega_Z} \log\left(\frac{p(y_n, z_n|\theta, \phi)}{q(z_n)}\right)q(z_n) dz_n\\ &= \sum_{n=1}^N \int_{\Omega_Z} \log\left(p(y_n| \theta, \phi)\right) q(z_n) dz_n - \sum_{n=1}^N \int_{\Omega_Z} \log\left(\frac{p(y_n, z_n|\theta, \phi)}{q(z_n)}\right)q(z_n) dz_n\\ &= \sum_{n=1}^N \int_{\Omega_Z} \left(\log\left(p(y_n| \theta, \phi)\right) - \log\left(\frac{p(y_n, z_n|\theta, \phi)}{q(z_n)}\right) \right)q(z_n) dz_n\\ &= \sum_{n=1}^N \int_{\Omega_Z} \log\left(\frac{p(y_n| \theta, \phi)q(z_n)}{p(y_n, z_n|\theta, \phi)} \right)q(z_n) dz_n\\ &= \sum_{n=1}^N \int_{\Omega_Z} \log\left(\frac{q(z_n)}{p(z_n| y_n, \theta, \phi)} \right)q(z_n) dz_n, \\ &\quad\left(\text{Baye's Rule: } \frac{p(y_n, z_n|\theta, \phi)}{p(y_n| \theta, \phi)} = p(z_n| y_n, \theta, \phi)\right)\\ &= \sum_{n=1}^N D_{\text{KL}} \left[ q(Z_n) \| p(Z_n| Y_n, \theta, \phi)\right]. \end{aligned}\]

Since \(\ell(\theta, \phi)\) is a constant, the difference \[\sum_{n=1}^N D_{\text{KL}} \left[ q(Z_n) \| p(Z_n| Y_n, \theta, \phi)\right] = \ell(\theta, \phi) - ELBO(\theta, \phi, q)\] descreases when \(ELBO(\theta, \phi, q)\) increases (and vice versa). Thus, maximizing the ELBO is equivalent to minimizing \(D_{\text{KL}} \left[ q(Z_n) \| p(Y_n| Z_n, \theta, \phi)\right]\):

\[ \underset{q}{\mathrm{argmax}}\, ELBO(\theta, \phi, q) = \underset{q}{\mathrm{argmin}}\sum_{n=1}^N D_{\text{KL}} \left[ q(Z_n) \| p(Z_n| Y_n, \theta, \phi)\right]. \]

Thus, we see that \[\begin{aligned} q^*(Z_n) &= \underset{q}{\mathrm{argmax}}\; ELBO(\theta^*, \phi^*, q) \\ &= \underset{q}{\mathrm{argmin}}\sum_{n=1}^N D_{\text{KL}} \left[ q(Z_n) \| p(Z_n| Y_n, \theta, \phi)\right] = p(Z_n| Y_n, \theta, \phi) \end{aligned}\]

That is, we should set the optimal distribution \(q\) to be the posterior \(p(Z_n| Y_n, \theta, \phi)\).

19.2.0.3 Iteration

Of course, we know that optimizing a function with respect to each variable is not sufficient for finding the global optimum over all the variables, considered together! Thus, performing one E-step and one M-step is not enough to maximize the ELBO. We need to repeat the two steps over and over.

(OPTIONAL!!!!) Why don’t gradients commute with expectation?

We have the following property of expectations:

\[ \nabla_z \mathbb{E}_{x\sim p(x)}[f(x, z)] = \mathbb{E}_{x\sim p(x)}[ \nabla_z f(x, z)] \]

That is, when the gradient is with respect to a variable that does not appear in the distribution with respect to which you are taking the expectation, then you can push the gradient past the expectation.

The intuition: the gradient with respect to \(z\) is computing the changes in a function by making infinitesimally small changes to \(z\), the expectation is computing the average value of a function by sampling \(x\) from a distribution that does not depend on \(z\). Each operation is making an independent change to two different variables and hence can be done in any order.

Why can’t you do this in general? I.e. why is it that,

\[ \nabla_z\mathbb{E}_{x\sim p(x|z)}[f(x, z)] \neq \mathbb{E}_{x\sim p(x|z)}[ \nabla_z f(x, z)]?\]

The intuition: the gradient with respect to z is computing the changes in a function by making infinitesimally small changes to z, which in turn affects the samples produced by p(x|z), these samples finally affect the output of f. This is a chain of effects and the order matters.

The formal proof: Consider the following case,

\[ p(x\vert z) = (z+1)x^z,\; x\in [0, 1] \]

and

\[ f(x, z) = xzf ( x , z ) = x z. \]

Then, we have

\[\begin{aligned} \nabla_z \mathbb{E}_{x\sim p(x|z)} [f(x, z)] &= \nabla_z \int_0^1 f(x, z) p(x|z) dx \\ &= \nabla_z\int_0^1 xz \cdot (z+1)x^z dx \\ &= \nabla_z z (z+1)\int_0^1x^{z+1} dx \\ &= \nabla_z \frac{z (z+1)}{z+2} [x^{z+2} ]_0^1 \\ &= \nabla_z \frac{z (z+1)}{z+2} \\ &= \frac{z^2 + 4z + 2}{(z+2)^2} \end{aligned}\]

On the other hand, we have

\[ \mathbb{E}_{x\sim p(x|z)}\left[ \nabla_z f(x, z) \right] = \int_0^1 \nabla_z[ xz] (z+1)x^zdx = \int_0^1(z+1)x^{z+1}dx = \frac{z+1}{z+2} [x^{z+2}]_0^1 = \frac{z+1}{z+2}. \]

Note that:

\[ \nabla_z \mathbb{E}_{x\sim p(x|z)} [f(x, z)] = \frac{z^2 + 4z+ 2}{(z+2)^2} \neq \frac{z+1}{z+2} = \mathbb{E}_{x\sim p(x|z)}\left[ \nabla_z f(x, z) \right]. \]

(Optional!!!) Why do we need to maximize the ELBO with respect to q?

Recall that in the derivation of the ELBO, we first introduced an auxiliary variable q to rewrite the observed log-likelihood:

\[ \log p(y|\theta, \phi) = \log \int_\Omega p(y, z| \theta, \phi) dz = \log \int_\Omega \frac{p(y, z| \theta, \phi}{q(z)}q(z) dz = \log \mathbb{E}_{q(z)} \left[ \frac{p(y, z|\theta, \phi)}{q(z)} \right] \]

Again, the reason why we do this is because: when we eventually take the gradient wrt to \(\theta, \phi\) during optimization we can use the identity

\[ \nabla_{\theta, \phi} \mathbb{E}_{q(z)}\left[\frac{p(y, z|\theta, \phi)}{q(z)}\right] = \mathbb{E}_{q(z)}\left[\nabla_{\theta, \phi} \frac{p(y, z|\theta, \phi)}{q(z)}\right] \]

At this point, there is no need to maximize over q, that is:

\[ \max_{\theta, \phi, q}\log \mathbb{E}_{q(z)}\left[\frac{p(y, z|\theta, \phi)}{q(z)}\right] = \max_{\theta, \phi}\log \mathbb{E}_{q(z)}\left[\frac{p(y, z|\theta, \phi)}{q(z)}\right] \]

The \(q\) cancels and has no effect on the outcome or process of the optimization (but you can’t just choose any \(q\) you want - can you see what are the constraints on \(q\)?).

Now, the problem is that the log is on the outside of the expectation. This isn’t a problem in the sense that we don’t know how to take the derivative of a logarithm of a complex function (this is just the chain rule ), the problem is that

\[ \nabla_{\phi, \theta} \frac{p(y, z|\theta, \phi)}{q(z)} \]

can be very complex (since p and q are pdf’s) and so over all the gradient of the log expectation is not something you can compute roots for. Here is where we push the log inside the expectation using Jensen’s inequality:

\[ \log \mathbb{E}_{q(z)}\left[\frac{p(y, z|\theta, \phi)}{q(z)}\right] \geq \mathbb{E}_{q(z)}\left[\log \left(\frac{p(y, z|\theta, \phi)}{q(z)}\right)\right] \overset{\text{def}}{=} ELBO(\phi, \theta, q) \]

When we push the log inside the expectation, we obtain the Evidence Lower Bound (ELBO).

Now, for any choice of \(q\), we always have:

\[ \max_{\theta, \phi}\log \mathbb{E}_{q(z)}\left[\frac{p(y, z|\theta, \phi)}{q(z)}\right] \geq \max_{\theta, \phi}ELBO(\phi, \theta, q) \]

But the ELBO is not necessarily a tight bound (i.e. maximizing the ELBO can be very far from maximizing the log-likelihood!)! In particular, some choices of \(q\) might give you a tighter bound on the log-likelihood than others. Thus, we want to select the \(q\) that give us the tightest bound:

\[ \max_{\theta, \phi}\log \mathbb{E}_{q(z)}\left[\frac{p(y, z|\theta, \phi)}{q(z)}\right] \geq \max_{\theta, \phi, q}ELBO(\phi, \theta, q). \]

19.3 The Expectation Maximization Algorithm

The exepectation maximization (EM) algorithm maximize the ELBO of the model,

  1. Initialization: Pick \(\theta_0\), \(\phi_0\).
  2. Repeat \(i=1, \ldots, I\) times:

E-Step: \[q_{\text{new}}(Z_n) = \underset{q}{\mathrm{argmax}}\; ELBO(\theta_{\text{old}}, \phi_{\text{old}}, q) = p(Z_n|Y_n, \theta_{\text{old}}, \phi_{\text{old}})\]

M-Step: \[\begin{aligned} \theta_{\text{new}}, \phi_{\text{new}} &= \underset{\theta, \phi}{\mathrm{argmax}}\; ELBO(\theta, \phi, q_{\text{new}})\\ &= \underset{\theta, \phi}{\mathrm{argmax}}\; \sum_{n=1}^N\mathbb{E}_{Z_n\sim p(Z_n|Y_n, \theta_{\text{old}}, \phi_{\text{old}})}\left[\log \left( p(y_n, Z_n | \phi, \theta\right) \right]. \end{aligned}\]

(Optional!!!) Another View of Expectation Maximization: The Auxiliary Function

We often denote the expectation in the M-step by \(Q\left(\theta, \phi| \theta^{\text{old}}, \phi^{\text{old}}\right)\) \[ Q\left(\theta, \phi| \theta^{\text{old}}, \phi^{\text{old}}\right) = \sum_{n=1}^N\mathbb{E}_{Z_n\sim p(Z_n|Y_n, \theta_{\text{old}}, \phi_{\text{old}})}\left[\log \left( p(y_n, Z_n | \phi, \theta\right) \right] \] and call \(Q\) the auxiliary function.

Frequently, the EM algorithm is equivalently presented as - E-step: compute the auxiliary function: \(Q\left(\theta, \phi| \theta^{\text{old}}, \phi^{\text{old}}\right)\) - M-step: maximize the auxiliary function: \(\theta^{\text{new}}, \phi^{\text{new}} = \underset{\theta, \phi}{\mathrm{argmax}}\,Q\left(\theta, \phi| \theta^{\text{old}}, \phi^{\text{old}}\right)\).

The log of the joint distribution \(\prod_{n=1}^N p(Z_n, Y_n, \theta, \phi)\) is called the complete data log-likelihood (since it is the likelihood of both observed and latent variables), whereas \(\log \prod_{n=1}^N p(Y_n| \theta, \phi)\) is called the observed data log-likelihood (since it is the likelihood of only the observed variable).

The auxiliary function presentation of EM is easy to interpret: - In the E-step, you fill in the latent variables in the complete data log-likelihood using “average” values, this leaves just an estimate of the observed log-likelihood. - In the M-step, you find parameters \(\phi\) and \(\theta\) that maximizes your estimate of the observed log-likelihood.

We chose to derive EM via the ELBO in this lecture because it makes an explicit connection between the EM algorithm for estimating MLE and variational inference method for approximating the posterior of Bayesian models. It is, however, worthwhile to derive EM using the auxiliary function \(Q\), as \(Q\) makes it convient for us to prove properties of the EM algorithm.

19.4 Monotonicity and Convergence of EM

Before we run off estimating MLE parameters of latent variable models with EM, we need to sanity check two points:

  1. (Monotonicity) we need to know that repeating the E, M-steps will never decrease the ELBO!
  2. (Convergence) we need to know that at some point the EM algorithm will naturally terminate (the algorithm will cease to update the parameters).

(Optional!!!) Proof of Properties of EM

We first prove the monotonicity of EM. Consider the difference between \(\ell(\theta, \phi) - \ell(\theta^{\text{old}}, \phi^{\text{old}})\), i.e. the amount by which the log-likelihood can increase or decrease by going from \(\theta^{\text{old}}, \phi^{\text{old}}\) to \(\theta, \phi\):

\[\begin{aligned} \ell(\theta, \phi) - \ell(\theta^{\text{old}}, \phi^{\text{old}}) &= \sum_{n=1}^N\log \left[ \frac{p(y_n|\theta, \phi)}{p(y_n| \theta^{\text{old}}, \phi^{\text{old}})}\right]\\ &= \sum_{n=1}^N \log\int \frac{p(y_n, z_n|\theta, \phi)}{p(y_n| \theta^{\text{old}}, \phi^{\text{old}})} dz_n\\ &= \sum_{n=1}^N \log\int \frac{p(y_n, z_n|\theta, \phi)}{p(y_n| \theta^{\text{old}}, \phi^{\text{old}}) p(z_n|y_n, \theta^{\text{old}}, \phi^{\text{old}})}p(z_n|y_n, \theta^{\text{old}}, \phi^{\text{old}}) dz_n\\ &= \sum_{n=1}^N \log\int \frac{p(y_n, z_n|\theta, \phi)}{p(y_n, z_n| \theta^{\text{old}}, \phi^{\text{old}})}p(z_n|y_n, \theta^{\text{old}}, \phi^{\text{old}}) dz_n\\ &= \sum_{n=1}^N \log \mathbb{E}_{p(z_n|y_n, \theta^{\text{old}}, \phi^{\text{old}})} \left[\frac{p(y_n, z_n|\theta, \phi)}{p(y_n, z_n| \theta^{\text{old}}, \phi^{\text{old}})}\right]\\ &\geq \sum_{n=1}^N \mathbb{E}_{p(z_n|y_n, \theta^{\text{old}}, \phi^{\text{old}})} \log\left[\frac{p(y_n, z_n|\theta, \phi)}{p(y_n, z_n| \theta^{\text{old}}, \phi^{\text{old}})}\right]\\ &= \sum_{n=1}^N \mathbb{E}_{p(z_n|y_n, \theta^{\text{old}}, \phi^{\text{old}})} \left[\log p(y_n, z_n|\theta, \phi) - \log p(y_n, z_n| \theta^{\text{old}}, \phi^{\text{old}})\right]\\ &= \sum_{n=1}^N \mathbb{E}_{p(z_n|y_n, \theta^{\text{old}}, \phi^{\text{old}})} \left[\log p(y_n, z_n|\theta, \phi)\right] - \sum_{n=1}^N \mathbb{E}_{p(z_n|y_n, \theta^{\text{old}}, \phi^{\text{old}})}\left[ \log p(y_n, z_n| \theta^{\text{old}}, \phi^{\text{old}})\right]\\ &= Q\left(\theta, \phi| \theta^{\text{old}}, \phi^{\text{old}}\right) - Q\left(\theta^{\text{old}}, \phi^{\text{old}}| \theta^{\text{old}}, \phi^{\text{old}}\right) \end{aligned}\]

Thus, when we maximize the gain in log-likelihood going from \(\theta^{\text{old}}, \phi^{\text{old}}\) to \(\theta, \phi\), we get:

\[\begin{aligned} \underset{\theta, \phi}{\max} \left[\ell(\theta, \phi) - \ell(\theta^{\text{old}}, \phi^{\text{old}})\right] \geq \underset{\theta, \phi}{\max} \left[Q\left(\theta, \phi| \theta^{\text{old}}, \phi^{\text{old}}\right) - Q\left(\theta^{\text{old}}, \phi^{\text{old}}| \theta^{\text{old}}, \phi^{\text{old}}\right)\right] \end{aligned}\]

or equivalently,

\[\begin{aligned} \underset{\theta, \phi}{\max} \left[\ell(\theta, \phi)\right] - \ell(\theta^{\text{old}}, \phi^{\text{old}}) \geq \underset{\theta, \phi}{\max} \left[Q\left(\theta, \phi| \theta^{\text{old}}, \phi^{\text{old}}\right)\right] - Q\left(\theta^{\text{old}}, \phi^{\text{old}}| \theta^{\text{old}}, \phi^{\text{old}}\right). \end{aligned}\]

Note that the above max is always greater than or equal to zero:

\[\underset{\theta, \phi}{\max} \left[Q\left(\theta, \phi| \theta^{\text{old}}, \phi^{\text{old}}\right)\right] - Q\left(\theta^{\text{old}}, \phi^{\text{old}}| \theta^{\text{old}}, \phi^{\text{old}}\right) \geq 0\]

since we can always maintain the status quo by choosing \(theta = \theta^{\text{old}}\) \(\phi = \phi^{\text{old}}\):

\[ Q\left(\theta^{\text{old}}, \phi^{\text{old}}| \theta^{\text{old}}, \phi^{\text{old}}\right) - Q\left(\theta^{\text{old}}, \phi^{\text{old}}| \theta^{\text{old}}, \phi^{\text{old}}\right) = 0.\]

Thus, we have that by maximizing \(Q\left(\theta, \phi| \theta^{\text{old}}, \phi^{\text{old}}\right)\), we ensure that \(\ell(\theta, \phi) - \ell(\theta^{\text{old}}, \phi^{\text{old}})\geq 0\) in each iteration of EM.

If the likelihood of the model is bounded above (i.e. \(\ell(\theta, \phi) \leq M\) for some constant \(M\)), then EM is guaranteed to convergence. This is because we’ve proved that EM increases (or maintains) log-likelihood in each iteration, therefore, if \(\ell(\theta, \phi)\) is bounded, the process must converge.

Disclaimer:

Although EM converges for bounded likelihoods, it is not guaranteed to converge to the global max of the log-likelihood! By formalizing an analogy between EM and gradient descent, you can show that EM converges to a stationary point of the likelihood. Often times, EM converges only to local optima of the likelihood function and the point to which it converges may be very sensitive to initialization.