Chapter 7 How Do We Responsibly Use Conditional Models?
7.1 Everything We’ve Done So Far in Probabilistic ML
For a set of observations \(y\), we can collect additional data \(\mathbf{x}\) and model the conditional distribution \(p(y|\mathbf{x})\). This is exactly regression and classification models do, they use covariates \(\mathbf{x}\) to predict the outcome \(y\).
(Conditional Models) Why do we need additional data \(\mathbf{x}\)? Why isn’t it sufficient to just model \(y\) by itself (as a binomial, Guassian or a mixture of Gaussians)?
- to predict
- to explain
- to model finer grain variations in y (i.e. model data heterogeneity)
(Classification) For binary data (like whether or not an individual has kidney cancer), we can model \(p(y|\mathbf{x})\) as a Bernoulli \[ y \sim \mathrm{Ber}\left(\mathrm{sigm}\left(f_\mathbf{w}\left(\mathbf{x} \right)\right)\right). \]
(Maximizing the Likelihood) We want to maximize the log Bernoulli likelihood of \(y\) (conditioned on \(\mathbf{x}\)) over model parameters \(\mathbf{w}\), but the zeros of the gradient of the log-likelihood cannot be analytically solved for! \[ \nabla_{\mathbf{w}} \ell(\mathbf{w}) = -\sum_{n=1}^N \left(y_n - \frac{1}{1 + e^{-\mathbf{w}^\top\phi(\mathbf{x}_{n})}} \right) \phi(\mathbf{x}_{n}) =\mathbf{0} \] But just because we can’t analytically find the zeros of the gradient, it doesn’t mean that the gradient is useless! The gradient of a loss function points to the direction of the greatest rate of increase, hence the negative gradient points to the direction of the fastest way of decreasing the loss function.
When we iteratively decrease the loss function by changing \(\mathbf{w}\) in the direction of the negative gradient, this is called gradient descent.
(Properties of Gradient Descent) Gradient descent sounds like we can optimize any function without work, is it as good as it sounds?
- What do I need to know in order to use gradient descent?
- Is gradient descent guaranteed to converge?
- Is gradient descent efficient?
(Model Evaluation) How do I know my model is “correct”?
- Choose a meaningful numeric predictive metric (e.g. MSE, log-likelihood, accuracy, AUC etc) – which metric measures the model behavior that I actually care about?
- Evaluate your model along other important real-life axes:
- Probabilistic vs Non-probabilistic
- Parametric vs Non-parametric
- Interpretability
(The Challenges of Conditional Models) Conditional models have so much potential to generate scientific understanding of our data but they are also the most frequently misused!
Suppose that you fit a logistic regression model to predict whether a loan application should be approved. Suppose that you have three covariates:
x_1
representing gender: 0 for male, 1 for femalex_2
for the incomex_3
for the loan amountSuppose that the parameters you found are: \[ p(y=1 | x_1, x_2, x_3) = \mathrm{sigm}(-1 + 3 x_1 + 1.5 x_2 + 1.75 x_3). \] What can I conclude about the loan decisions data that this model was fitted on? Can I deploy this model for usage?
- Bias in your conclusions is exponentially exacerbated by bias in the data collection process.
What information are you asking for? If you ask the wrong question, the covariates you collect are at best unpredictive, and at worse they can mask the signal in the data.
In many data collection processes, one finds that the questionaire might be using “sex” and “gender” interchangeably, or “race” and “ethnicity” interchageably. But the former (depending on field and context) frequently refers to biological invariants (like chromosomes and genetic lineage), whereas the latter frequently refers to lived experiences (related to identity, culture etc). Caveat: although we collect data on race and sex as if delineation in these categories are more “immutable” or “factual”, many have/do argue that these concepts are also social constructs (see discussion on false binaries and false aggregtaion)!
When you ask for “sex” are you asking about a person’s chromosomal information, becaues you’re studying chromosomal related diseases, or are you asking about their experiences with the US medical system and how their health outcome is affected by their gendery identity OR the identity that people tend to assume of them? For example: women, especially women of color, tend to be under-diagnosed and mis-diagnosed due to gendered treatments of patients.
To complicate matters, the terms “gender” and “sex” can have very specific legal definitions – that may not align with defintions in medical or sociological contexts! For example, the language of the Equal Credit Opportunity Act is written in terms of “sex”, however, this term is not defined in the ECOA. The Consumer Financial Protection Bureau interprets (in their operating philosophy) this term to include both sexual orientation and gender identity.
Did you already bias the answer? If you gave confusing, misleading or incorrect options for answers (think about when you were frustrated with the concept quizzes!), then the covariates you collect are at best unpredictive, and at worse they can mask the signal in the data.
In many data collection processes, one finds that the respondent is forced into a category that is unrepresentative of their experience, like having “Asian” as a broad racial category that includes South Asian, East Asian, South East Asian, like providing binary choices for gender or sex.
If you are looking for disparaties in health outcomes, you might miss disparaties amongst important smaller subgroups within broad group of “Asian”, since health outcomes are often affected by income and Asian-Americans have the largest in-group income disparity.
Should you be asking for this information? The complexity of asking a seemingly simple question can seem overwhelming, often we resort to over-collecting (collecting every piece of information that we can think of) or under-collecting (not collecting any sensitive attribute).
Why you might not want to But respondents (e.g. patients, applicants for benefits) can be suspicious that their data can be used precisely to discriminate against them on the basis of protected attributes. Note that data collection have often historically been used to oppress, not to up-lift, vunerable communities. So collecting this data without clear purpose can compromise user trust and constitute a privacy threat if this data is not well protected.
Why you should However, in order to make sure our model is not disproportionately negatively impacting vunerable subgroups and that we are in compliancenb with laws like the EOCA, we need to check for disparate impact on protected groups. But in order to do that, we need to have access to sensitive attributes.
- Is correlation causation?
It appears that the sex of the applicant the logistic regression model has the greatest influence on the prediction of the model, does this mean that the human decision makers were biased? (Hint: not necssarily, but it might be a hypothesis to explore to retroactively detect potential regulatory violation in this dataset.)
If the interpretations of the coefficients of the logistic regression model does not imply human bias in the data, then can we use this model for real life loan decisions? (Hint: the answer is NO. It would be an violation!)
- Bias in your conclusions is exponentially exacerbated by bias in the data collection process.