Chapter 20 Review of Latent Variables, Compression and Clustering

20.0.1 Example: Gaussian Mixture Models (GMMs)

In a Gaussian Mixture Model (GMM), we posit that the observed data \(Y\) is generated by a mixture, \(\pi=[\pi_1, \ldots, \pi_K]\), of \(K\) number of Gaussians with means \(\mu = [\mu_1, \ldots, \mu_K]\) and covariances \(\Sigma = [\Sigma_1, \ldots, \Sigma_K]\). For each observation \(Y_n\) the class of the observation \(Z_n\) is a latent variable that indicates which of the \(K\) Gaussian is responsible for generating \(Y_n\):

\[\begin{aligned} Z_n &\sim Cat(\pi),\\ Y_n | Z_n&\sim \mathcal{N}(\mu_{Z_n}, \Sigma_{Z_n}), \end{aligned}\]

where \(n=1, \ldots, N\) and \(\sum_{k=1}^K \pi_k = 1\).

GMMs are examples of model based clustering - breaking up a data set into natural clusters based on a statistical model fitted to the data.

20.0.2 Example: Item-Response Models

In item-response models, we measure an real-valued unobserved trait \(Z\) of a subject by performing a series of experiments with binary observable outcomes, \(Y\):

\[\begin{aligned} Z_n &\sim \mathcal{N}(\mu, \sigma^2),\\ \theta_n &= g(Z_n)\\ Y_n|Z_n &\sim Ber(\theta_n), \end{aligned}\]

where \(n=1, \ldots, N\) and \(g\) is some fixed function of \(Z_n\).

20.0.2.1 Applications

Item response models are used to model the way “underlying intelligence” \(Z\) relates to scores \(Y\) on IQ tests.

Item response models can also be used to model the way “suicidality” \(Z\) relates to answers on mental health surveys. Building a good model may help to infer when a patient is at psychiatric risk based on in-take surveys at points of care through out the health-care system.

20.0.3 Example: Factor Analysis Models

In factor analysis models, we posit that the observed data \(Y\) with many measurements is generated by a small set of unobserved factors \(Z\):

\[\begin{aligned} Z_n &\sim \mathcal{N}(0, I),\\ Y_n|Z_n &\sim \mathcal{N}(\mu + \Lambda Z_n, \Phi), \end{aligned}\]

where \(n=1, \ldots, N\), \(Z_n\in \mathbb{R}^{D'}\) and \(Y_n\in \mathbb{R}^{D}\). We typically assume that \(D'\) is much smaller than \(D\).

20.0.3.1 Applications

Factor analysis models are useful for biomedical data, where we typically measure a large number of characteristics of a patient (e.g. blood pressure, heart rate, etc), but these characteristics are all generated by a small list of health factors (e.g. diabetes, cancer, hypertension etc). Building a good model means we may be able to infer the list of health factors of a patient from their observed measurements.

20.1 PCA Versus Probabilistic PCA (pPCA)

Suppose that our data \(\mathbf{X}\) is centered (if our data is not centered then reasoning about PCA as a projection that minimizes reconstruction loss uses uglier formulae). Let’s also assume that \(z_n\) is one dimensional and \(\mathbf{x}_n\) is two dimensional.

	PCA	pPCA
Model	\(z_n = v^\top \mathbf{x}_n \in \mathbb{R}\) (Projection), \(\hat{\mathbf{x}}_n = vz_n \in \mathbb{R}^2\) (Reconstrcution)	\(z_n\sim \mathcal{N}(0, 1)\), \(\mathbf{x}_n\\| z_n \sim \mathcal{N}(vz_n, \sigma^2)\)
Goal	Minimize Reconstruction Loss	Maximize Observed Log-likelihood
Training Objective	\(\mathrm{min}_v \sum_{n=1}^N\\\| \mathbf{x}_n - vz_n\\\|^2_2\), \(\\\|v\\\|_2 =1\)	\(\mathrm{max}_{v, q({z}_n)}\; \sum_{n=1}^N\mathrm{ELBO}(v, q({z}_n))\) (Optional: \(\\\|v\\\|_2=1\))
Good For	Compression: \(\mathbf{x}_n \mapsto z_n\)	Modeling data distribution: \(\mathbf{x}_n \sim p(\mathbf{x}_n\\| v)\)
		Generating New Data: \(z^\mathrm{new}_n\sim \mathcal{N}(0, 1)\), \(\mathbf{x}^\mathrm{new}_n\\| z^\mathrm{new}_n \sim \mathcal{N}(vz^\mathrm{new}_n, \sigma^2)\)
How to “Compress”	Compute \(z_n = v^\top \mathbf{x}_n\)	Infer \(p(z_n \\| \mathbf{x}_n, v)\)
Training Algorithm	SVD (find eigenvalues)	Expectation Maximization
Deep Model Analogue	Autoencoder	Variational Autoencoder

20.1.1 What to Know About Expectation Maximization

20.1.1.1 Motivation: Why Can’t We Autodiff Log-likelihood and Gradient Descent?

Can’t directly maximize conditional likelihood: \[\max_v \sum_{n=1}^N\log p(\mathbf{x}_n | z_n, v),\] because \(v^*\) will be in terms of unobserved variables \(z_n\). This makes our \(v^*\) useless.
Can’t directly maximize observed data likelihood: \[\max_v \sum_{n=1}^N\log p(\mathbf{x}_n | v) = \max_v \sum_{n=1}^N\log \mathbb{E}_{p(z_n)} [p(\mathbf{x}_n | z_n, v)],\] because we either:
- can’t compute \(\mathbb{E}_{p(z_n)} [p(\mathbf{x}_n | z_n, v)]\) in closed form (the integral is too hard), and/or we cannot push the gradient inside the expectation
or alternatively
- we can compute compute \(\mathbb{E}_{p(z_n)} [p(\mathbf{x}_n | z_n, v)]\) in closed form (the integral is easy), but gradient descent on \(\log p(\mathbf{x}_n | v)\) struggles to converge

20.1.1.2 Our Solution

We optimize a lower bound for the observed data loglikelihood, that is also easier for us to work with: \[ \sum_{n=1}^{N}\mathrm{ELBO}(v, q(\mathbf{z}_n)) = \sum_{n=1}^{N} \mathbb{E}_{q(z_n)}\left[ \log \frac{p(\mathbf{x}_n|z_n, v)p(z_n)}{q(z_n)}\right] \]
The ELBO is a more tractable obejctive to work with because:
- the expectation is with respect to a distribution we choose, \(q(z_n)\), and thus we can choose \(q(z_n)\) to not depend on \(v\). So the gradient \(\nabla_v\) can be pushed inside the expectation.
- we can optimize the ELBO via coordinate-wise optimization, taking turns optimizing over each \(q\) and \(v\), while holding the other constant:
  - (E-Step) when optimizing the ELBO with respect to \(q\), a lot of algebra shows us that we just need to set \(q(z_n) = p(z_n| \mathbf{x}_n, v^\mathrm{old})\).
    
    Note: this is “easy” only if we can find the posterior \(p(z_n| \mathbf{x}_n, v^\mathrm{old})\) in closed form. When we don’t know the posterior in closed form, in the E-step we typically perform variational inference.
  - (M-Step) when optimizing the ELBO with respect to \(v\), a lot of algebra shows us that the ELBO greatly simplifies: \[ v^\mathrm{new}=\mathrm{argmax}_v \mathbb{E}_{p(z_n| \mathbf{x}_n, v^\mathrm{old})} \left[ \log p(\mathbf{x}_n|z_n, v) + \log p(z_n)\right] \] Note: this optimization is typically done analytically but you can also optimize in the M-step using gradient descent.

20.2 Non-Probabilistic Clustering Versus Probabilistic Clustering

	Non-Probabilistic	Probabilistic (Mixture Models)
Model	Uses some notion of similarity to make clusters	\(z_n\sim \mathrm{Cat}(\mathbf{\pi})\), \(\mathbf{x}_n\\| z_n \sim \mathcal{N}(\mathbf{\mu}_{z_n}, \Sigma_{z_n})\)
Goal	Maximize cluster “quality”	Maximize Observed Log-likelihood
Training Objective	Minimize some loss	\(\mathrm{max}_{v, q({z}_n)}\; \sum_{n=1}^N\mathrm{ELBO}(\pi, \mu_{z_n}, \Sigma_{z_n}, q({z}_n))\), with \(\\\|\pi\\\|_1=1\)
Good for	Producing hard cluster assignments	Producing soft cluster assignments (uncertainty)
		Generating new data: (1) sample \(z_n = k\), (2) sample \(\mathbf{x}_n\) from the \(k\)-th Gaussian
Training Algorithm	Various	Expectation Maximization