Chapter 4 What are Probablistic and Non-Probablistic Regression?

4.1 What is Probabilistic Regression?

As we saw in Lecture #2, in regression, we can choose amongst a very large number of loss functions (i.e. functions that quantify the fit of our model). One way to justify our choice is to reason explicitly about how residuals (prediction errors) arise.

If we include in our model specification a theory of how residuals arise as a random variable, then we have a probabilistic regression model.

A probabilistic regression problem is the task of predicting an output value \(y \in \mathbb{R}\), given an input \(\mathbf{x} \in \mathbb{R}^D\), where the data generating process includes a source of random noise \(\epsilon\).

We have seen one way to solve a probabilistic regression problem, where the observation noise is additive: 1. choose a training data set of \(N\) number of observations, \(\mathcal{D} = \{(\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_N, y_N)\}\). 2. hypothesize that the relationship between \(\mathbf{x}\) and \(y\) is captured by \[ f_\mathbf{w}(\mathbf{x}) \] where \(\mathbf{w} \in \mathbb{R}^M\) are the parameters (i.e. unknown coefficients) of the function \(f\). 3. choose a RV \(\epsilon\) that describes the additive observation noise: \[ y = f_\mathbf{w}(\mathbf{x}) + \epsilon,\; \epsilon \sim p(\epsilon | \theta) \] This choice defines a likelihood \(p(y| \mathbf{w}, \mathbf{x})\), describing the likelihood of observing \((y, \mathbf{x})\) given that the model is \(f_\mathbf{w}\). 5. choose a math notion for “how well \(f_\mathbf{w}\) fits the noisy data”, i.e. a loss function, \(\mathcal{L}(\mathbf{w})\). 6. choose a way to solve for the parameters \(\mathbf{w}\) that minimizes the loss function: \[ \mathbf{w}^* = \mathrm{argmin}_\mathbf{w}\; \mathcal{L}(\mathbf{w}) \]

This framework applies to probabilistic regression with any functional form for \(f\) (linear, polynomial, neural network etc).

Note: while \(\mathbf{w}\) is learnt, the parameters \(\theta\) of the noise distribution \(p(\epsilon | \theta)\) is typically set by the ML engineer prior to learning \(\mathbf{w}\). Parameters like \(\theta\) that must be chosen prior to learning are called hyperparameters.

Technically, \(y\) is conditioned on \(\mathbf{w}\), \(\mathbf{x}\) and \(\theta\) in the likelihood, \(p(y| \mathbf{w}, \mathbf{x}, \theta)\), but typically we drop the \(\theta\) and write \(p(y| \mathbf{w}, \mathbf{x})\) if we are not intending to infer it from the data.

4.2 (Almost) Everything is Linear Regression

Overwhelmingly frequently in machine learning, we assume additive zero-mean homoskedastic Gaussian noise, that is \[ y = f_\mathbf{w}(\mathbf{x}) + \epsilon,\; \epsilon \sim \mathcal{N}(0, \sigma^2) \] which implies a Gaussian likelihood: \[ y | \mathbf{w}, \mathbf{x} \sim \mathcal{N}(f_\mathbf{w}(\mathbf{x}), \sigma^2). \] We choose the negative joint log-likelihood of the data as the metric for how well our model fits the data \[ \mathcal{L}(\mathbf{w}) = -\sum_{n=1}^N \log p(y_n | \mathbf{w}, \mathbf{x}_n). \] Choosing this loss follows the Maximum Likelihood Principle, encoding our belief that models that renders the observed data more likely are better models.

Now, we learn parameters of \(f\) to minimize \(\mathcal{L}\) (equivalently, maximizing the joint log-likelihood of the data): \[ \mathbf{w}^{\text{MLE}} = \mathrm{argmin}_{\mathbf{w}} \mathcal{L}(\mathbf{w}) \] The solution \(\mathbf{w}^{\text{MLE}}\) is called the maximum likelihood estimate of \(\mathbf{w}\).

We showed that under this set of assumption and choices, probabilistic regression is equivalent to non-probabilistic regression: \[ \mathbf{w}^{\text{MLE}} = \mathbf{w}^{\text{OLS}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} \] Maximizing log-likelihood is the same as minimizing the MSE.

Question 1: If the solution to probablistic regression is equivalent to non-probabilistic regression, does this mean that probabilistic regression models are equivalent to non-probabilistic models?

Question 2: We saw that MSE: 1. cannot be used alone to evaluate the “goodness” of a model 2. can be helpful to detect overfitting 3. may be unhelpful in detecting underfitting

What about the log-likelihood?

Question 3: How does one evaluate a probabilistic model?

Question 4: How do we choose hyperparameters of our probabilistic model?

  1. Validation
  2. Prior or Domain Knowledge
  3. Analytically or Algorithmically Optimizing an Objective Function: \[ \sigma^{\text{MLE}} = \mathrm{argmax}_\sigma\; \ell(\mathbf{w}^{\text{MLE}}, \sigma) = \mathrm{argmax}_\sigma\; \log\left[\prod_{n=1}^N p(y_n | \mathbf{x}_n, \mathbf{w}^{\text{MLE}}, \sigma)\right] \]

4.3 The Cube: A Model Comparison Paradigm

4.3.0.1 I. Axis: Probabilistic vs Non-probabilistic

Model Definition Pro Con
Probabilistic Specifying a distribution for the data (and potentially the model), explicitly defining the source and type of noise When model is “wrong”, we have more assumptions we can interrogate and change in order to improve the model We need to make more choices (e.g. assuming a distribution for the noise); more choices means more chances to be wrong
Non-Probabilistic Specify a functional form for the data; does not explicitly define sources and types of data randomness We make less assumptions about the data (which may be wrong); not all useful objective functions (e.g. human personal preference) can be easily given a probabilistic interpretation When the “model” is wrong, we have less assumptions to attribute the problem and hence fewer obvious ways to fix the model

Probabilistic: regression by specifying likelihood

Non-probabilistic: regression by minimizing MSE, KNN, kernel regression (for now), regression trees (for now)

Question: When is it better to use probabilistic regression, when is it better to use non-probabilistic regression?

4.3.0.2 II. Axis: Parametric vs Non-Parametric

Model Definition Pro Con
Parametric Assumes a fixed, finite functional form \(f(\mathbf{x})\) for regressor Once the parameters of the functional form is learnt, only the parameters needs to be saved Need to guess the right functional form for the regressor
Non-Parametric Functional form is implicit and can adapt to observed data No need to specify an explicit functional form Typically, the training data needs to be saved for every prediction

Parametric: linear, polynomial, arbitrary basis regression (probabilistic and non-probabilistic); regression trees

Non-Parametric: KNN, kernel regression

Question: When is it better to use parametric regression, when is it better to use non-parametric regression?

4.3.0.3 III. Axis: Ease of Interpretation

Model Example Potential Pro Potential Con
Simple Models Models that can be easily inspected in its entirety, models whose decision process can be easily described Through interpretation these models can provide scientific insight, be validated by humans, encourage user trust Simple models may be, though are not necessarily, lacking in capacity to capture complex trends in data
Complex Models Models with too many parameters, complex non-linear transformation of data, or whose decision process is lengthy Complex models often have the capacity to capture very interesting trends in the data Complex models may be, though are not necessarily, more difficult to interpret, explain, validate and trust

Model Interpretation: interpretability isn’t a binary label, every model can be “interpreted” in some way: - linear regression: looking at regression weights - arbitrary basis regression (including neural networks): looking at regression weights and the features \(\phi(\mathbf{x})\) - probabilistic regression: looking at regression weights and the noise distribution - regression tree: printing out the tree as a sequence of branching decisions - KNN: looking at the \(k\) nearest neighbors - kernel regression: interpreting the kernel \(k\) as a measure of similarity between two inputs \(\mathbf{x}\) and \(\mathbf{x}'\)

Question: When do we need interpretable models, when does interpretability not matter? Who needs interpretable models? What does interpretable mean?