Algorithmic Fairness

January 20, 2025 AllenDowney

This is the last in a series of excerpts from Elements of Data Science, now available from Lulu.com and online booksellers.

This article is based on the Recidivism Case Study, which is about algorithmic fairness. The goal of the case study is to explain the statistical arguments presented in two articles from 2016:

“Machine Bias”, by Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner, and published by ProPublica.
A response by Sam Corbett-Davies, Emma Pierson, Avi Feller and Sharad Goel: “A computer program used for bail and sentencing decisions was labeled biased against blacks. It’s actually not that clear.”, published in the Washington Post.

Both are about COMPAS, a statistical tool used in the justice system to assign defendants a “risk score” that is intended to reflect the risk that they will commit another crime if released.

The ProPublica article evaluates COMPAS as a binary classifier, and compares its error rates for black and white defendants. In response, the Washington Post article shows that COMPAS has the same predictive value black and white defendants. And they explain that the test cannot have the same predictive value and the same error rates at the same time.

In the first notebook I replicated the analysis from the ProPublica article. In the second notebook I replicated the analysis from the WaPo article. In this article I use the same methods to evaluate the performance of COMPAS for male and female defendants. I find that COMPAS is unfair to women: at every level of predicted risk, women are less likely to be arrested for another crime.

You can run this Jupyter notebook on Colab.

Male and female defendants

The authors of the ProPublica article published a supplementary article, How We Analyzed the COMPAS Recidivism Algorithm, which describes their analysis in more detail. In the supplementary article, they briefly mention results for male and female respondents:

The COMPAS system unevenly predicts recidivism between genders. According to Kaplan-Meier estimates, women rated high risk recidivated at a 47.5 percent rate during two years after they were scored. But men rated high risk recidivated at a much higher rate – 61.2 percent – over the same time period. This means that a high-risk woman has a much lower risk of recidivating than a high-risk man, a fact that may be overlooked by law enforcement officials interpreting the score.

We can replicate this result using the methods from the previous notebooks; we don’t have to do Kaplan-Meier estimation.

According to the binary gender classification in this dataset, about 81% of defendants are male.

male = cp["sex"] == "Male"
male.mean()

0.8066260049902967

female = cp["sex"] == "Female"
female.mean()

0.19337399500970334

Here are the confusion matrices for male and female defendants.

from rcs_utils import make_matrix

matrix_male = make_matrix(cp[male])
matrix_male

	Pred Positive	Pred Negative
Actual
Positive	1732	1021
Negative	994	2072

matrix_female = make_matrix(cp[female])
matrix_female

	Pred Positive	Pred Negative
Actual
Positive	303	195
Negative	288	609

And here are the metrics:

from rcs_utils import compute_metrics

metrics_male = compute_metrics(matrix_male, "Male defendants")
metrics_male

	Percent
Male defendants
FPR	32.4
FNR	37.1
PPV	63.5
NPV	67.0
Prevalence	47.3

metrics_female = compute_metrics(matrix_female, "Female defendants")
metrics_female

	Percent
Female defendants
FPR	32.1
FNR	39.2
PPV	51.3
NPV	75.7
Prevalence	35.7

The fraction of defendants charged with another crime (prevalence) is substantially higher for male defendants (47% vs 36%).

Nevertheless, the error rates for the two groups are about the same. As a result, the predictive values for the two groups are substantially different:

PPV: Women classified as high risk are less likely to be charged with another crime, compared to high-risk men (51% vs 64%).
NPV: Women classified as low risk are more likely to “survive” two years without a new charge, compared to low-risk men (76% vs 67%).

The difference in predictive values implies that COMPAS is not calibrated for men and women. Here are the calibration curves for male and female defendants.

_images/0c00fcec5fcb5d27076980d67c956f77bd0f84a3c39072a7a423b9f462b40780.png

For all risk scores, female defendants are substantially less likely to be charged with another crime. Or, reading the graph the other way, female defendants are given risk scores 1-2 points higher than male defendants with the same actual risk of recidivism.

To the degree that COMPAS scores are used to decide which defendants are incarcerated, those decisions:

Are unfair to women.
Are less effective than they could be, if they incarcerate lower-risk women while allowing higher-risk men to go free.

What would it take?

Suppose we want to fix COMPAS so that predictive values are the same for male and female defendants. We could do that by using different thresholds for the two groups. In this section, we’ll see what it would take to re-calibrate COMPAS; then we’ll find out what effect that would have on error rates.

From the previous notebook, sweep_threshold loops through possible thresholds, makes the confusion matrix for each threshold, and computes the accuracy metrics. Here are the resulting tables for all defendants, male defendants, and female defendants.

from rcs_utils import sweep_threshold

table_all = sweep_threshold(cp)

table_male = sweep_threshold(cp[male])

table_female = sweep_threshold(cp[female])

As we did in the previous notebook, we can find the threshold that would make predictive value the same for both groups.

from rcs_utils import predictive_value

matrix_all = make_matrix(cp)
ppv, npv = predictive_value(matrix_all)

from rcs_utils import crossing

crossing(table_male["PPV"], ppv)

array(3.36782883)

crossing(table_male["NPV"], npv)

array(3.40116329)

With a threshold near 3.4, male defendants would have the same predictive values as the general population. Now let’s do the same computation for female defendants.

crossing(table_female["PPV"], ppv)

array(6.88124668)

crossing(table_female["NPV"], npv)

array(6.82760429)

To get the same predictive values for men and women, we would need substantially different thresholds: about 6.8 compared to 3.4. At those levels, the false positive rates would be very different:

from rcs_utils import interpolate

interpolate(table_male["FPR"], 3.4)

array(39.12)

interpolate(table_female["FPR"], 6.8)

array(9.14)

And so would the false negative rates.

interpolate(table_male["FNR"], 3.4)

array(30.98)

interpolate(table_female["FNR"], 6.8)

array(74.18)

If the test is calibrated in terms of predictive value, it is uncalibrated in terms of error rates.

ROC

In the previous notebook I defined the receiver operating characteristic (ROC) curve. The following figure shows ROC curves for male and female defendants:

from rcs_utils import plot_roc

plot_roc(table_male)
plot_roc(table_female)

_images/6c3b663c6aee1331f0bd08cae9c973892e52b264950db4baa0d3db68ea2dc790.png

The ROC curves are nearly identical, which implies that it is possible to calibrate COMPAS equally for male and female defendants.

Summary

With respect to sex, COMPAS is fair by the criteria posed by the ProPublica article: it has the same error rates for groups with different prevalence. But it is unfair by the criteria of the WaPo article, which argues:

A risk score of seven for black defendants should mean the same thing as a score of seven for white defendants. Imagine if that were not so, and we systematically assigned whites higher risk scores than equally risky black defendants with the goal of mitigating ProPublica’s criticism. We would consider that a violation of the fundamental tenet of equal treatment.

With respect to male and female defendants, COMPAS violates this tenet.

So who’s right? We have two competing definitions of fairness, and it is mathematically impossible to satisfy them both. Is it better to have equal error rates for all groups, as COMPAS does for men and women? Or is it better to be calibrated, which implies equal predictive values? Or, since we can’t have both, should the test be “tempered”, allowing both error rates and predictive values to depend on prevalence?

In the next notebook I explore these trade-offs in more detail. And I summarized these results in Chapter 9 of Probably Overthinking It.

Probably Overthinking It

Data science, Bayesian Statistics, and other ideas

Algorithmic Fairness

January 20, 2025 AllenDowney

Male and female defendants

What would it take?

ROC

Summary

Related

Male and female defendants

What would it take?

ROC

Summary

Share this:

Related