Provably Fair Machine Learning: Rawlsian Fairness for Contextual Multi-armed Bandits

In a continuation of my machine learning series, today I will discuss a more cutting-edge topic in machine learning that has significant social import: fairness.

In the modern era, there is significant momentum towards using machine learning techniques with the plethora of available data to make key decisions affecting many people's lives.

From determining which advertisements to show consumers to determining which individuals should receive loans, the scale to which machine learning plays a role in tech-driven decision making is unprecedented. As a consequence, more attention is being drawn to the impact that such decisions can have on people and society, especially when healthcare and the law come into play.

Last year, ProPublica published an article detailing discovery of racial bias in a machine learning algorithm used to predict recidivism risk (i.e. the likelihood of a criminal being a repeat offender).

Clearly, such bias is unacceptable, especially if the results of the prediction is used in determining lengths of prison sentences.

Today's post will explore a particular approach towards designing a provably fair machine-learning model.

Formulating Rawlsian Fairness

In their paper draft, Joseph et. al. offer a definition for a new framework for fair learning algorithms based on an interpretation of John Rawls's notion of "fair equality of opportunity."

“. . . assuming there is a distribution of natural assets, those who are at the same level
of talent and ability, and have the same willingness to use them, should have the same
prospects of success regardless of their initial place in the social system.”

Such a definition protects against both explicit discrimination (e.g. where choices are made as a direct consequence of someone's race) as well as disparate impact. Disparate impact describes cases where a choice discriminates against a protected group, such as a racial minority, due to observations of indirect qualities, regardless of the neutrality of the rules applied in the decision. For example, if company X requires job applicants to lift heavy objects regularly, one might observe that the employment practice has an adverse effect on the hiring of women. In order to argue against accusations of disparate impact, company X would have to show that the hiring practice is necessary and job-related.

Therein lies the essence of devising a fair algorithm in the Rawlsian setting:

Given the choice between two options – if one option is just as good as the other, the chances that you pick it should fairly reflect this!

Intuitively speaking, suppose you run a bank, and two people approach you asking for a loan. You somehow determine the probability that each person will return the loan (and the interest). Suppose that person 1 has 80% chance of returning the loan, and person 2 has 50% chance. Then, the probability that you choose to give the loan to person 1 should be at least the probability that you choose person 2 instead! (Otherwise, you'd be biased against person 1.)

This is the basis for Joseph et. al.'s definition of fairness.

Contextual Bandits

Let's take the loans example from before and frame it in a formal way.

Suppose that each day, the bank receives loan applications from $k$ individuals, and can pick exactly one person to give the loan to. At the end of the day, the bank observes whether or not the loan is returned in the form of some probability of success.

Mathematically, each individual's attributes are represented by a $d$ dimensional vector $x_i \in \mathbb{R}^d$ (i.e. $d$ defining characteristics such as years of education, income, etc. represented by real numbers). The $i$th individual's reward is denoted by $0 \leq y_i \leq 1$.

We further assume that each of the $k$ people are representative of $k$ different groups or populations, and have an associated quality function $f_i,$ $1 \leq i \leq k$ describing how likely each person is to return the loan given their attributes.

You might wonder why this assumption is valid. In fact, one could argue that it is necessary! For a particular population, length of education could be a strong predictor for a return on loan. However, for another group of people, it may not matter much at all. It would be unfair to weigh education heavily across all groups.

Almost all of the set-up is complete. We note that this setting is an online learning game. The algorithm starts off not knowing the true $f_i$ functions. Each day, the algorithm picks a person to give the loan - and observes whether it guessed right. The idea is then that the algorithm would learn from both its wins and losses! Over a long period of time - it would learn enough to guess correctly every day - finding a better approximation to the $f_i$'s.

This problem is known as contextual bandits, deriving its name from the game in which a bandit in a casino needs to pick between several slot machines to maximize his profit. The traditional problem is known as multi-armed bandits, referring to the lever arms of the slot machines. The distinction between contextual and multi-armed bandits is in the information gained after each day (time step).

In the traditional setting, the bank would learn whether each of the $k$ people would have returned the loan. In contextual bandits, the bank is not as omniscient - only observing information about the chosen applicant.

Exploration vs. Exploitation

There are a few points to address here - from an optimization and cost effective standpoint - we want the algorithm to learn as quickly as possible. Every mistake costs the bank money, since it does not get its loan back (let alone the interest!). The bank's shareholders would not be happy if it made a series of faulty loans.

A bad scenario is the following: the bank arbitrarily starts off giving loans to group 1 out of 3, and observes a high rate of return. What would incentivize the bank to take a risk and give loans to people from groups 2/3? The less information it has about the groups, the less confident the bank is in giving the loan. What results is a positive feedback loop. The bank gives loans increasingly to group 1 and only learns about this particular group, and ignores the rest. Clearly, not only is this not fair - it could also result in poor performance ( i.e. group 3 could actually have a higher rate of return, but the bank would never learn this).

This problem is called one of exploration vs. exploitation. Since we only learn about the applicants chosen, the bank needs to explore - take occasional risks and give loans to those it is not as certain about (due to a lack of information). To avoid too many losses, it should also exploit its existing knowledge to maximize its chances of getting the loan back. As you can imagine, finding the right balance is tricky. There is a tradeoff between efficiency of learning and maximization of return.

Encoding Fairness

Additionally, we want the algorithm to be fair, in the same Rawlsian notion we defined above. We don't just want the algorithm to be asymptotically fair (in the grand scheme of things), but fair at each step.

This restriction can be represented mathematically as follows:

At each time step $t$, for any two individuals $x_{it}$, $x_{jt}$, if $f(x_{it}) \geq f(x_{jt})$, then the probability of choosing individual $i$ is at least that of choosing $j$.

Interval Chaining

We further apply the assumption that each of the $f_i$ quality functions takes the form of a linear function:

$$f_i(x_{i}) = \beta_i \cdot x_{i} = \beta_{i,1} x_{i, 1} + \beta_{i,2} x_{i, 2} + \dots + \beta_{i,d} x_{i, d}$$

A reasonable approach for estimating the $f_i$ functions is to find estimators for the $\beta$ parameters using linear regression.


Before showing the actual algorithm, here is the crux of its logic: for each round, do the following.

For each arm (applicant), compute an estimate for its quality with a corresponding confidence interval. Let $i_{*}^t$ indicate the arm with the largest upper bound for its quality. Then, determine which arms have confidence intervals that when overlapping, form a contiguous chain with $i_{*}^t$'s interval.

Intuitively speaking, we don't really know how good our estimates are! What if we erroneously gave arm $1$ too high of an estimate, whereas arm $2$'s was spot on? Hence, we use confidence intervals to hedge against this mistake. As you will later see, we can with some assumptions exactly compute these confidence intervals to guarantee that the true value falls into the confidence interval with only a small probability of failure.

In the above graphic, $i_{*}^t = 1$. Then, by the algorithm, we would compute the chain of intervals connected to $1$, which would result in $[1, 2, 3, 4]$. Then, play one of these arms uniformly at random.

As it turns out, this algorithm proves to be fair under the Rawlsian definition!

Following is the proposed algorithm (in Pythonic pseudocode) for solving linear contextual bandits:

import numpy as np

def interval_chaining(delta_, T):
    # 1. Pull each arm once to start
    for t in range(T):
        if np.random.rand() <= 1/pow(t, 1/3):
            # 2. Play an exploration round by
            #    picking an arm at random
            # 2. Compute confidence intervals 
            #    using linear regression (OLS)
            intervals = []
            for i in range(k):
                # 3a. Compute OLS estimator
                # 3b. Compute confidence interval for estimator
                # 3c. Store interval in `intervals`
            # 4. Find interval with largest upper bound
            i_st = np.argmax(np.array(intervals)[:,1])
            # 5. Find all intervals that contiguously overlap
            #    with said interval
            chain = compute_chain(i_st, np.array(intervals))
            Y[i][t] = play(np.random.choice(chain))

For a complete implementation of the algorithm, you can look at a version published in this repository. The details of how to compute the confidence interval mathematically will be shown in the proof below.

From here on out, things will get pretty math-heavy and technical. It helps to have exposure to probability, elementary statistics, and linear algebra. However, I try to explain in words alongside all of the math - so don't worry too much if the notation is hard to follow!

To demonstrate how this algorithm is provably fair, we'll need to make use of some useful properties of OLS.

Ordinary Least Squares (OLS)

Before we get into the further details of proving the fairness algorithm, let's run through a quick primer through OLS and some of its properties. Feel free to skip through this section if you are already comfortable!

Suppose we have data for a set of regressor variables denoted $x_{i1}, \dots x_{id}$ and another variable, $y_i$. (The $i$ indicates that this set represents the $i$th sample in a dataset, as opposed to before where it denoted the arm number.) The regressors are our features (e.g. education, years of work experience), and the $y_i$ is what we are trying to predict. In OLS, we assume that there is a linear relationship between the regressors and the $y_i$.

Geometrically speaking, given a set of points, there is a line (in 2D, or a hyperplane in > 2D) that 'perfectly' fits through all of the points and can be used as a model to predict on future data. That is, we assume the following:

$$y_i = \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_d x_{id} + \epsilon_i = \sum_{j=1}^d \beta_j x_{ij} + \epsilon_i$$

for $i = 1, \dots, n$ where $n$ is the number of observations.

While the form is slightly different, this is a more general case of your standard $y = mx + b$ best fit.

$\epsilon_i$ is termed a residual, and serves to capture errors in measurement of the dependent variable and/or the effects of omitted variables that might also affect the dependent variable. These are assumed to be normally distributed, such that $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$. (If you are not familiar with the normal distribution, think of the bell curve!) The distribution is thus centered at $0$ and has some standard deviation $\sigma$.

The crux of OLS is that we want to find the values for $\beta$ that minimize the sum of squared errors:

\epsilon_i &= y_i - \sum_{j=1}^d \beta_j x_{ij} = y_i - \beta x_i\\
\sum_{i=1}^n \epsilon_i^2 &= \sum_{i=1}^n (y_i - \beta x_i)^2\\
\hat{\beta} &= \text{argmin}_{\beta} \sum_{i=1}^n (y_i - \beta x_i)^2

Stacking all of the $x_{i}$ vectors row-wise and the $y_i, \epsilon_i$ values into vectors, we can write the linear model in matrix form as:

$$Y = X\beta + \epsilon$$

Skipping some tedious matrix calculus, we can show that the OLS estimator for $\beta$, denoted as $\hat{\beta}$ is

$$\hat{\beta} = (X'X)^{-1} X'Y$$

Asymptotic Distribution of OLS Estimator

We assume in this example that the regressors are nonstochastic, i.e. they are not random. Then, $E[x_i] = x_i$.

The point. We show that the OLS estimator $\hat{\beta}$ is normally distributed, centered around the true parameter $\beta$. This will be useful, as we'll immediately be able to show that the estimate for $y_{t,i}$ we get is also normally distributed! This lets us easily compute confidence intervals.

We first note the assumption that the residuals $\epsilon_i$ are presumed to be normally distributed with mean $0$ and variance $\sigma^2$.

We then observe the general property where for a random vector $v \sim \mathcal{N}(b, \Sigma)$, given a matrix $A$,

$$Av \sim \mathcal{N}(Ab, A\Sigma A')$$

Thus, we show the following.

\hat{\beta} &= (X'X)^{-1} X'Y\\
&= (X'X)^{-1}X'(X\beta + \epsilon)\\
&= \beta + (X'X)^{-1}X'\epsilon\\
\hat{\beta} - \beta &= (X'X)^{-1}X'\epsilon\\
&\sim \mathcal{N}(0, (X'X)^{-1} X' \sigma^2 I X(X'X)^{-1})\\
&\sim \mathcal{N}(0, \sigma^2 (X'X)^{-1} X' I X(X'X)^{-1})\\
&\sim \mathcal{N}(0, \sigma^2(X'X)^{-1})\\
&\implies \hat{\beta} \sim \mathcal{N}(\beta, \sigma^2(X'X)^{-1})

Proof of Fairness

We now leverage this useful property to prove that the interval chaining algorithm above is fair under our desired constraints, $$\hat{\beta}_{t,i} \sim \mathcal{N}(\beta_{t,i}, \sigma^2(X^T_{t,i} X_{t,i})^{-1})$$

Recall that in this context, $i$ denotes the arm number and $t$ denotes the sample index, $t \in {1, \dots, T}$, $i \in {1, \dots, k}$. The values we are working with are vectors and matrices, so the feature index is hidden.

Then, for any fixed $x_{t,i}$, we have (using the same property from the above proof) $$\hat{\beta}_{t,i} \cdot x_{t,i} \sim \mathcal{N}(\beta_{t,i} \cdot x_{t,i}, x_{t,i}^T \sigma^2(X^T_{t,i} X_{t,i})^{-1} x_{t,i})$$

In the above algorithm, let the confidence interval be computed as follows:
$$[\ell_i^t, u_i^t] = [\hat{y}_{t,i} - w_{t,i}, \hat{y}_{t,i} + w_{t,i}] $$

where $w_{t,i} = \Phi^{-1}(\frac{\delta}{2kT})$ and $\Phi^{-1}(\cdot)$ is the inverse CDF of the normal distribution, i.e. the value at which the probability of the applied estimator is less than or equal to the given probability bound.

The point. Intuitively, we are choosing a probability of failure - and find the necessary width for the interval that will guarantee that the true quality will never be outside of the interval with probability greater than what we picked.

Then, the true value $\beta_{t,i} \cdot x_{t,i} \not\in [\ell_i^t, u_i^t]$ with probability at most $\frac{\delta}{kT}$. The probability then that this fails to hold for any $i$ at any time $t$ is at most $kT \cdot \frac{\delta}{kT} = \delta$ by the union bound. Conditioning on this probability for the rest of the argument, we then show the fairness condition holds:

For all arms $i, j$ and all rounds $t$, the probability that the algorithm places on arm $i$ in round $t$ is at least as great as the probability the algorithm places on arm $j$ in the same round.

We're almost there. Consider any fixed pair of arms $i, j$ and a fixed round $t$. If $\beta_{t,i}\cdot x_{t,i} \geq \beta_{t,j} \cdot x_{t,j}$, then clearly $u_i^t \geq \beta_{t,i} \cdot x_{t,i} $ and $\ell_j^t \leq \beta_{t,j} \cdot x_{t,j}$, and

$$u_i^t \geq \beta_{t,i} \cdot x_{t,i} \geq \beta_{t,j} \cdot x_{t,j} \geq \ell_j^t$$

We have two cases. Either $\ell_i^t \geq u_j^t$, in which case there is no overlap between the two intervals. In the other case, $\ell_i^t \leq u_j^t$, and then if $j$ is chained to the arm with the largest upper bound $i_{*}^t$, then $i$ will as well.

Thus, either both arms are played with equal probability (1/the size of the chain), or only $i$ will be played with that probability and $j$ with probability $0$, or both are played with probability $0$ (neither are in the chain).

Clearly, the fairness condition holds! Since we considered any arbitrary pair of arms, this holds in generality.

Performance of Interval Chaining

I won't get into the technicalities here, but it can be shown that in practice, the time it takes for the above algorithm to converge to the correct answer (presuming that the truth is in fact a linear model) is not much worse than if the fairness constraint were to be ignored. Joseph et. al. have a detailed proof of this in their paper in the Appendices.

Closing Remarks

What was discussed in this post was a novel framework for evaluating the fairness of learning algorithms that are round based (online). There are great ethical concerns involved with using machine learning to make key decisions with significant social and financial impact. It is critical that data scientists and engineers work to make their learning algorithms as transparent, fair, and yet still useful as possible. If you are interested in following the academic conversation, check out FAT/ML, an annual workshop held to discuss fairness, accountability, and transparency in ML.

Also – observe that the techniques described in this post do not rectify for underlying bias in the data. While the assumption was made that data comes from separably distinct sources - this assumption may not always hold! It is significantly more difficult to profile existing bias in data - and the very act of trying to correct for it actually introduces bias in itself. There has been some work showing that the very nature of human language is coupled with ethnic and gender bias.

The approach taken by Joseph et. al. is not as ambitious as to correct for all bias - but to simply make internal choices that are as fair as the algorithm can discern.

If you are seeking to use these techniques in actually building out a bank that serves loans, you might be in for some significant challenges. As noted by a reader, in practice not all loans operate on the same time-frame, not to mention that the round-based assumption may not hold. Additionally, even if applications were to be received in groups of size $k$, there is no saying that the bank wouldn't want to accept more than $1$! There are clear challenges in taking this particular algorithm to the real world, but it is useful in illustrating a viable solution to a fair learning problem.

If you've read this far, thank you so much! This work is very interesting to me, and I hope I was able to teach you something new today with this. As usual, if there any typos or mistakes - or if you have any questions, feel free to leave a comment below.


Many thanks to the artists whose graphics I have used in this post.