Density and Likelihood: What’s the Difference?

July 13, 2024 AllenDowney

It’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

If you get this post by email, the formatting might be broken — if so, you might want to read it on the site.

likelihood

Density and Likelihood¶

Here’s a question from the Reddit statistics forum.

I’m a math graduate and am partially self taught. I am really frustrated with likelihood and probability density, two concepts that I personally think are explained so disastrously that I’ve been struggling with them for an embarrassingly long time. Here’s my current understanding and what I want to understand:

probability density is the ‘concentration of probability’ or probability per unit and the value of the density in any particular interval depends on the density function used. When you integrate the density curve over all outcomes x in X where X is a random variable and x are its realizations then the result should be all the probability or 1.

likelihood is the joint probability, in the discrete case, of observing fixed and known data depending on what parameter(s) value we choose. In the continuous case we do not have a nonzero probability of any single value but we do have nonzero probability within some infinitely small interval (containing infinite values?) [x, x+h] and maximizing the likelihood of observing this data is equivalent to maximizing the probability of observing it, which we can do by maximizing the density at x.

My questions are:

Is what I wrote above correct? Probability density and likelihood are not the same thing. But what the precise distinction is in the continuous case is not completely cut and dry to me. […]

I agree with OP — these topics are confusing and not always explained well. So let’s see what we can do.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

Out[1]:

'utils.py'

In [2]:

# install the empiricaldist library, if necessary

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

In [3]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

Mass¶

I’ll start with a discrete distribution, so we can leave density out of it for now and focus on the difference between a probability mass function (PMF) and a likelihood function.

As an example, suppose we know that a hockey team scores goals at a rate of 3 goals per game on average. If we model goal scoring as a Poisson process — which is not a bad model — the number of goals per game follows a Poisson distribution with parameter mu=3.

The PMF of the Poisson distribution tells us the probability of scoring k goals in a game, for non-negative values of k.

In [4]:

from scipy.stats import poisson

ks = np.arange(12)
ps = poisson.pmf(ks, mu=3)

Here’s what it looks like.

In [5]:

plt.bar(ks, ps)
decorate(xlabel='Goals (k)', ylabel='Probability mass')

No description has been provided for this image

The PMF is a function of k, with mu as a fixed parameter.

The values in the distribution are probability masses, so if we want to know the probability of scoring exactly 5 goals, we can look it up like this.

In [6]:

ps[5]

Out[6]:

0.10081881344492458

It’s about 10%.

The sum of the ps is close to 1.

In [7]:

ps.sum()

Out[7]:

0.999928613371026

If we extend the ks to infinity, the sum is exactly 1. So this set of probability masses is a proper discrete distribution.

Likelihood¶

Now suppose we don’t know the goal scoring rate, but we observe 4 goals in one game. There are several ways we can use this data to estimate mu. One is to find the maximum likelihood estimator (MLE), which is the value of mu that makes the observed data most likely.

To find the MLE, we need to maximize the likelihood function, which is a function of mu with a fixed number of observed goals, k. To evaluate the likelihood function, we can use the PMF of the Poisson distribution again, this time with a single value of k and a range of values for mu.

In [8]:

k = 4
mus = np.linspace(0, 20, 201)
ls = poisson.pmf(k, mus)

Here’s what the likelihood function looks like.

In [9]:

plt.plot(mus, ls)
decorate(xlabel='mu', ylabel='Likelihood')

To find the value of mu that maximizes the likelihood of the data, we can use argmax to find the index of the highest value in ls, and then look up the corresponding element of mus.

In [10]:

i = np.argmax(ls)
mus[i]

Out[10]:

4.0

In this case, the maximum likelihood estimator is equal to the number of goals we observed.

That’s the answer to the estimation problem, but now let’s look more closely at those likelihoods. Here’s the likelihood at the maximum of the likelihood function.

In [11]:

np.max(ls)

Out[11]:

0.19536681481316454

This likelihood is a probability mass — specifically, it is the probability of scoring 4 goals, given that the goal-scoring rate is exactly 4.0.

In [12]:

poisson.pmf(4, mu=4)

Out[12]:

0.19536681481316454

So, some likelihoods are probability masses — but not all.

Density¶

Now suppose, again, that we know the goal scoring rate is exactly 3, but now we want to know how long it will be until the next goal. If we model goal scoring as a Poisson process, the time until the next goal follows an exponential distribution with a rate parameter, lam=3.

Because the exponential distribution is continuous, it has a probability density function (PDF) rather than a probability mass function (PMF). We can approximate the distribution by evaluating the exponential PDF at a set of equally-spaced times, ts.

SciPy’s implementation of the exponential distribution does not take lam as a parameter, so we have to set scale=1/lam.

In [13]:

from scipy.stats import expon

lam = 3.0
ts = np.linspace(0, 2.5, 201)
ps = expon.pdf(ts, scale=1/lam)

The PDF is a function of t with lam as a fixed parameter. Here’s what it looks like.

In [14]:

plt.plot(ts, ps)
decorate(xlabel='Games until next goal', ylabel='Density')

Notice that the values on the y-axis extend above 1. That would not be possible if they were probability masses, but it is possible because they are probability densities.

By themselves, probability densities are hard to interpret. As an example, we can pick an arbitrary element from ts and the corresponding element from ps.

In [15]:

ts[40], ps[40]

Out[15]:

(0.5, 0.6693904804452895)

So the probability density at t=0.5 is about 0.67. What does that mean? Not much.

To get something meaningful, we have to compute an area under the PDF. For example, if we want to know the probability that the first goal is scored during the first half of a game, we can compute the area under the curve from t=0 to t=0.5.

We can use a slice index to select the elements of ps and ts in this interval, and NumPy’s trapz function, which uses the trapezoid method to compute the area under the curve.

In [16]:

np.trapz(ps[:41], ts[:41])

Out[16]:

0.7769608771522626

The probability of a goal in the first half of the game is about 78%. To check that we got the right answer, we can compute the same probability using the exponential CDF.

In [17]:

expon.cdf(0.5, scale=1/lam)

Out[17]:

0.7768698398515702

Considering that we used a discrete approximation of the PDF, our estimate is pretty close.

This example provides an operational definition of a probability density: it’s something you can add up over an interval — or integrate — to get a probability mass.

Likelihood and Density¶

Now let’s suppose that we don’t know the parameter lam and we want to use data to estimate it. And suppose we observe a game where the first goal is scored at t=0.5. As we did when we estimated the parameter mu of the Poisson distribution, we can find the value of lam that maximizes the likelihood of this data.

First we’ll define a range of possible values of lam.

In [18]:

lams = np.linspace(0, 20, 201)

Then for each value of lam, we can evaluate the exponential PDF at the observed time t=0.5 — using errstate to ignore the “division by zero” warning when lam is 0.

In [19]:

t = 0.5

with np.errstate(divide='ignore'):
    ls = expon.pdf(t, scale=1/lams)

The result is a likelihood function, which is a function of lam with a fixed values of t. Here’s what it looks like.

In [20]:

plt.plot(lams, ls)
decorate(xlabel='Scoring rate (lam)', ylabel='Likelihood')

Again, we can use argmax to find the index where the value of ls is maximized, and look up the corresponding value of lams.

In [21]:

i = np.argmax(ls)
lams[i]

Out[21]:

2.0

If the first goal is scored at t=0.5, the value of lam that maximizes the likelihood of this observation is 2.

Now let’s look more closely at those likelihoods. If we select the corresponding element of ls, the result is a probability density.

In [22]:

ls[i]

Out[22]:

0.7357588823428847

Specifically, it’s the value of the exponential PDF with parameter 2, evaluated at t=0.5.

In [23]:

expon.pdf(0.5, scale=1/2)

Out[23]:

0.7357588823428847

So, some likelihoods are probability densities — but as we’ve already seen, not all.

Discussion¶

What have we learned?

In the first example, we evaluated a Poisson PMF at discrete values of k with a fixed parameter, mu. The results were probability masses.
In the second example, we evaluated the same PMF for possible values of a parameter, mu, with a fixed value of k. The result was a likelihood function where each point is a probability mass.
In the third example, we evaluated an exponential PDF at possible values of t with a fixed parameter, lam. The results were probability densities, which we integrated over an interval to get a probability mass.
In the fourth example, we evaluated the same PDF at possible values of a parameter, lam, with a fixed value of t. The result was a likelihood function where each point is a probability density.

A PDF is a function of an outcome — like the number of goals scored or the time under the first goal — given a fixed parameter. If you evaluate a PDF, you get a probability density. If you integrate density over an interval, you get a probability mass.

A likelihood function is a function of a parameter, given a fixed outcome. If you evaluate a likelihood function, you might get a probability mass or a density, depending on whether the outcome is discrete or continuous. Either way, evaluating a likelihood function at a single point doesn’t mean much by itself. A common use of a likelihood function is finding a maximum likelihood estimator.

As OP says, “Probability density and likelihood are not the same thing”, but the distinction is not clear because they are not completely distinct things, either.

A probability density can be a likelihood, but not all densities are likelihoods.
A likelihood can be a probability density, but not all likelihoods are densities.

I hope that helps!

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

PMFs and PDFs

July 6, 2024 AllenDowney

It’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

If you get this post by email, the formatting is not good — you might want to read it on the site.

pmf_and_pdf

PMFs and PDFs¶

Here’s a question from the Reddit statistics forum.

I met a basic problem in pdf and pmf. I perform a grid approximation on bayesian problem. After normalizing the vector, I got a discretized pmf. Then I want to draw pmf and pdf on a plot to check if they are similar distribution. However, the pmf doesn’t resemble the pdf at all. The instruction told me that I need to sample from my pmf then draw a histogram with density for comparison. It really works, but why can’t I directly compare them?

In Think Bayes I used this kind of discrete approximation in almost every chapter, so this issue came up a lot! There are at least two good solutions:

Normalize the PDF and PMF so they are on the same scale, or
Use CDFs.

I’ll demonstrate both.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

In [2]:

# install the empiricaldist library, if necessary

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

Baseball¶

As an example, I’ll solve an exercise from Chapter 4 of Think Bayes:

In Major League Baseball, most players have a batting average between .230 and .300, which means that their probability of getting a hit is between 0.23 and 0.30.

Suppose a player appearing in their first game gets 3 hits out of 3 attempts. What is the posterior distribution for their probability of getting a hit?

To represent the prior distribution, I’ll use a beta distribution with parameters I chose to be consistent with the expected range of batting averages.

In [3]:

from scipy.stats import beta

a, b = 40, 110
beta_prior = beta(a, b)
beta_prior.mean(), beta_prior.ppf([0.15, 0.85])

Out[3]:

(0.26666666666666666, array([0.22937673, 0.30411982]))

Here’s what the PDF of this distribution looks like.

In [4]:

qs = np.linspace(0, 1, 201)
ps = beta_prior.pdf(qs)

In [5]:

plt.plot(qs, ps, label='prior')
decorate(xlabel='Batting average', ylabel='PDF')

Unlike probability masses, probability densities can exceed 1. But the area under the curve should be 1, which we can check using trapz, which is a NumPy function that computes numerical integrals using the trapezoid rule.

In [6]:

area = np.trapz(ps, qs)
area

Out[6]:

1.0

Within floating-point error, the area under the PDF is 1. To do the Bayesian update, I’ll put these values in a Pmf object and normalize it so the sum of the probability masses is 1.

In [7]:

from empiricaldist import Pmf

pmf_prior = Pmf(ps, qs, copy=True)
pmf_prior.normalize()

Out[7]:

200.0

Now we can use binom to compute the probability of the data for each possible batting average in qs.

In [8]:

from scipy.stats import binom

k = 3
n = 3
likelihood = binom.pmf(k, n, qs)

To compute the posterior distribution, we multiply the prior by the likelihood and normalize again so the sum of the posterior Pmf is 1.

In [9]:

pmf_posterior = pmf_prior * likelihood
pmf_posterior.normalize()

Out[9]:

0.020006971070059262

The result is a discrete approximation of the actual posterior distribution — so let’s see how close it is.

Because the beta distribution is the conjugate prior of the binomial likelihood function, the posterior is also a beta distribution, with parameters updated to reflect the data: 3 successes and 0 failures.

In [10]:

beta_posterior = beta(a+3, b)

Here’s what this theoretical posterior distribution looks like compared to our numerical approximation.

In [11]:

ps = beta_posterior.pdf(qs)
plt.plot(qs, ps, color='gray', label='beta')
pmf_posterior.plot(label='posterior')

decorate(xlabel='Batting average', ylabel='PDF/PMF')

Oops! Something has gone wrong. I assume this is what OP meant by “the pmf doesn’t resemble the pdf at all”.

The problem is that the PMF is normalized so the total of the probability masses is 1, and the PDF is normalized so the area under the curve is 1. They are not on the same scale, so the y-axis in this figure is different for the two curves.

To fix the problem, we can find the area under the PMF.

In [12]:

area = np.trapz(pmf_posterior.ps, pmf_posterior.qs)
area

Out[12]:

0.004999999999999999

And divide through to create a discrete approximation of the posterior PDF.

In [13]:

pdf_posterior = pmf_posterior / area

Now we can compare density to density.

In [14]:

ps = beta_posterior.pdf(qs)
plt.plot(qs, ps, color='gray', label='beta')
pdf_posterior.plot(label='posterior')

decorate(xlabel='Batting average', ylabel='PDF')

The curves are visually indistinguishable, and the numerical differences are small.

In [15]:

diff = ps - pdf_posterior.ps
np.max(np.abs(diff))

Out[15]:

4.973799150320701e-14

As an aside, note that the posterior and prior distributions are not very different. The prior mean is 0.267 and the posterior mean is 0.281. If a rookie goes 3 for 3 in their first game, that’s a good start, but it doesn’t mean they are the next Ted Williams.

In [16]:

pmf_prior.mean(), pmf_posterior.mean()

Out[16]:

(0.2666666666666667, 0.28104575163398693)

As an alternative to comparing PDFs, we could convert the PMF to a CDF, which contains the cumulative sum of the probability masses.

In [17]:

cdf_posterior = pmf_posterior.make_cdf()

And compare to the mathematical CDF of the beta distribution.

In [18]:

ps = beta_posterior.cdf(qs)
plt.plot(qs, ps, color='gray', label='beta')

cdf_posterior.step(label='posterior')

decorate(xlabel='Batting average', ylabel='CDF')

In this example, I plotted the discrete approximation of the CDF as a step function — which is technically what it is — to highlight how it overlaps with the beta distribution.

Discussion¶

Probability density is hard to define — the best I can do is usually, “It’s something that you can integrate to get a probability mass”. And when you work with probability densities computationally, you often have to discretize them, which can add another layer of confusion.

One way to distinguish PMFs and PDFs is to remember that PMFs are normalized so sum of the probability masses is 1, and PDFs are normalized so the area under the curve is 1. In general, you can’t plot PMFs and PDFs on the same axes because they are not in the same units.

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

Regrets and Regression

July 1, 2024 AllenDowney

It’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

standardize

Standardization and Normalization¶

Here’s a question from the Reddit statistics forum.

I want to write a research article that has regression analysis in it. I normalized my independent variables and want to include in the article the results of a statistical test showing that all variables are normal. I normalized using the scale function in R and some custom normalization functions I found online but whatever I do the new data fails the Shapiro Wilkinson and KS test on some columns? What to do?

There might be a few points of confusion here:

One is the idea that the independent variables in a regression model have to follow a normal distribution. This is a common misconception. Ordinary least squares regression assumes that the residuals follow a normal distribution — the dependent and independent variables don’t have to.
Another is the idea that statistical tests are useful for checking whether a variable follows a normal distribution. As I’ve explained in other articles, these tests don’t do anything useful.
The question might also reveal confusion about what “normalization” means. In the context of regression models, it usually means scaling a variable so its range is between 0 and 1. Its sounds like OP might be using it to mean transforming a variable so it follows a normal distribution.

In general, transforming an independent variable to normal is not a good idea. It has no benefit, and it changes the estimated parameters for no justified reason. Let me explain.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

In [2]:

# install the empiricaldist library, if necessary

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

Happiness¶

As an example, I’ll use data from the World Happiness Report, which uses survey data from 153 countries to explore the relationship between self-reported happiness and six potentially predictive factors:

Income as represented by per capita GDP.
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption

The dependent variable, happiness, is the national average of responses to the “Cantril ladder question” used by the Gallup World Poll:

Please imagine a ladder with steps numbered from zero at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?

The data they used is available here.

In [3]:

download('https://happiness-report.s3.amazonaws.com/2020/WHR20_DataForFigure2.1.xls')

Out[3]:

'WHR20_DataForFigure2.1.xls'

We can read the data into a Pandas DataFrame.

In [4]:

df = pd.read_excel('WHR20_DataForFigure2.1.xls')
df.head()

Out[4]:

	Country name	Regional indicator	Ladder score	Standard error of ladder score	upperwhisker	lowerwhisker	Logged GDP per capita	Social support	Healthy life expectancy	Freedom to make life choices	Generosity	Perceptions of corruption	Ladder score in Dystopia	Explained by: Log GDP per capita	Explained by: Social support	Explained by: Healthy life expectancy	Explained by: Freedom to make life choices	Explained by: Generosity	Explained by: Perceptions of corruption	Dystopia + residual
0	Finland	Western Europe	7.8087	0.031156	7.869766	7.747634	10.639267	0.954330	71.900826	0.949172	-0.059482	0.195445	1.972317	1.285190	1.499526	0.961271	0.662317	0.159670	0.477857	2.762835
1	Denmark	Western Europe	7.6456	0.033492	7.711245	7.579955	10.774001	0.955991	72.402504	0.951444	0.066202	0.168489	1.972317	1.326949	1.503449	0.979333	0.665040	0.242793	0.495260	2.432741
2	Switzerland	Western Europe	7.5599	0.035014	7.628528	7.491272	10.979933	0.942847	74.102448	0.921337	0.105911	0.303728	1.972317	1.390774	1.472403	1.040533	0.628954	0.269056	0.407946	2.350267
3	Iceland	Western Europe	7.5045	0.059616	7.621347	7.387653	10.772559	0.974670	73.000000	0.948892	0.246944	0.711710	1.972317	1.326502	1.547567	1.000843	0.661981	0.362330	0.144541	2.460688
4	Norway	Western Europe	7.4880	0.034837	7.556281	7.419719	11.087804	0.952487	73.200783	0.955750	0.134533	0.263218	1.972317	1.424207	1.495173	1.008072	0.670201	0.287985	0.434101	2.168266

I’ll select the columns we’re interested in and give them shorter names.

In [5]:

data = pd.DataFrame()

data['ladder'] = df['Ladder score']
data['log_gdp'] = df['Logged GDP per capita']
data['social'] = df['Social support']
data['life_exp'] = df['Healthy life expectancy']
data['freedom'] = df['Freedom to make life choices']
data['generosity'] = df['Generosity']
data['corruption'] = df['Perceptions of corruption']

Notice that GDP is already on a log scale, which is generally a good idea. The other variables are in a variety of units and different scales.

Now, contrary to the advice OP was given, let’s run a regression model without even looking at the data.

In [6]:

import statsmodels.formula.api as smf

formula = 'ladder ~ log_gdp + social + life_exp + freedom + generosity + corruption'

results_raw = smf.ols(formula, data=data).fit()
results_raw.params

Out[6]:

Intercept    -2.059377
log_gdp       0.229079
social        2.723318
life_exp      0.035307
freedom       1.776815
generosity    0.410566
corruption   -0.628162
dtype: float64

Because the independent variables are on different scales, the estimated parameters are in different units. For example, life_exp is healthy life expectancy at birth in units of years, so the estimated parameter, 0.035, indicates that a difference of one year of life expectancy between countries is associated with an increase of 0.035 units on the Cantril ladder.

Now let’s check retroactively whether this dataset is consistent with the assumption that the residuals follow a normal distribution. I’ll use the following functions to compare the CDF of the residuals to a normal model with the same mean and standard deviation.

In [7]:

from scipy.stats import norm
from empiricaldist import Cdf

def make_normal_model(sample):
    """Make a Cdf of a normal distribution.
    
    sample: sequence of numbers
    
    returns: Cdf object
    """
    m, s = sample.mean(), sample.std()
    qs = np.linspace(m - 4 * s, m + 4 * s, 101)
    ps = norm.cdf(qs, m, s)
    return Cdf(ps, qs)

In [8]:

def plot_normal_model(sample, **options):
    """Plot the Cdf of a sample and a normal model.
    
    sample: sequence of numbers
    """
    cdf_model = make_normal_model(sample)
    cdf_data = Cdf.from_seq(sample)

    cdf_model.plot(color='gray')
    cdf_data.plot(**options)

In [9]:

plot_normal_model(results_raw.resid, label='residuals')
decorate(xlabel='residuals', ylabel='CDF')

The normal model fits the residuals well enough that the deviations are not a concern. This regression model is just fine with no transformations required.

But suppose we follow the advice OP received and check whether the dependent and independent variables follow normal distributions.

Looking for trouble¶

Here are the distributions of the variables compared to a normal model. Sometimes the model fits pretty well, sometimes not.

In [10]:

column = 'ladder'
plot_normal_model(data[column], label=column)
decorate(xlabel=column, ylabel='CDF')

In [11]:

column = 'log_gdp'
plot_normal_model(data[column], label=column)
decorate(xlabel=column, ylabel='CDF')

In [12]:

column = 'social'
plot_normal_model(data[column], label=column)
decorate(xlabel=column, ylabel='CDF')

In [13]:

column = 'life_exp'
plot_normal_model(data[column], label=column)
decorate(xlabel=column, ylabel='CDF')

In [14]:

column = 'freedom'
plot_normal_model(data[column], label=column)
decorate(xlabel=column, ylabel='CDF')

In [15]:

column = 'generosity'
plot_normal_model(data[column], label=column)
decorate(xlabel=column, ylabel='CDF')

In [16]:

column = 'corruption'
plot_normal_model(data[column], label=column)
decorate(xlabel=column, ylabel='CDF')

If we ran a statistical test, most of these would fail. But we don’t care — regression models don’t require these variables to follow normal distributions. And when we make this kind of comparison across countries, there no reason to expect them to.

Standardize¶

Nevertheless, there are reasons we might want to transform the independent variables before running a regression model — one is to quantify the “importance” of the different factors. We can do that by standardizing the variables, which means transforming them to have mean 0 and standard deviation 1.

In the untransformed variables, the means vary in magnitude from about 0.1 to 64.

In [17]:

data.mean().sort_values()

Out[17]:

generosity    -0.014568
corruption     0.733120
freedom        0.783360
social         0.808721
ladder         5.473240
log_gdp        9.295706
life_exp      64.445529
dtype: float64

And the standard deviations vary from about 0.1 to 7.

In [18]:

data.std().sort_values()

Out[18]:

freedom       0.117786
social        0.121453
generosity    0.151809
corruption    0.175172
ladder        1.112270
log_gdp       1.201588
life_exp      7.057848
dtype: float64

And, as I’ve already noted, they are expressed in different units. There is no meaningful way to compare the parameter of life_exp, which is steps of the Cantril ladder per year of life expectancy, with the parameter of corruption, which is in steps of the ladder per percentage point. But we can make these comparisons meaningful by standardizing the variables — that is, subtracting off the mean and dividing by the standard deviation.

In [19]:

standardized = (data - data.mean()) / data.std()

After transformation, the means are all close to 0.

In [20]:

standardized.mean()

Out[20]:

ladder       -2.786442e-16
log_gdp       6.966105e-16
social       -3.715256e-16
life_exp     -5.108477e-16
freedom      -3.250849e-16
generosity   -2.612289e-17
corruption   -2.554239e-16
dtype: float64

And the standard deviations are all close to 1.

In [21]:

standardized.std()

Out[21]:

ladder        1.0
log_gdp       1.0
social        1.0
life_exp      1.0
freedom       1.0
generosity    1.0
corruption    1.0
dtype: float64

Here’s the regression with the transformed variables.

In [22]:

results_standardized = smf.ols(formula, data=standardized).fit()
results_standardized.params

Out[22]:

Intercept    -4.072828e-16
log_gdp       2.474749e-01
social        2.973701e-01
life_exp      2.240379e-01
freedom       1.881597e-01
generosity    5.603632e-02
corruption   -9.892975e-02
dtype: float64

The intercept is close to 0 — that’s a consequence of how the math works out. And now the magnitudes of the parameters indicate how much a change in each independent variable, expressed as a multiple of its standard deviation, is associated with a change in the dependent variables, also as a multiple of its standard deviation.

By this metric, social is the most important factor, followed closely by log_gdp and life_exp. generosity is the least important factor, at least as its quantified by this variable.

Normalizing¶

An alternative to standardization is normalization, which transforms a variable so its range is between 0 and 1. To be honest, I’m not sure what the point of normalization is — it makes the results a little harder to interpret, and it doesn’t have any advantages I can think of.

But just for completeness, here’s how it’s done.

In [23]:

normalized = (data - data.min()) / (data.max() - data.min())

As intended, the normalized variables have the same range.

In [24]:

normalized.min()

Out[24]:

ladder        0.0
log_gdp       0.0
social        0.0
life_exp      0.0
freedom       0.0
generosity    0.0
corruption    0.0
dtype: float64

In [25]:

normalized.max()

Out[25]:

ladder        1.0
log_gdp       1.0
social        1.0
life_exp      1.0
freedom       1.0
generosity    1.0
corruption    1.0
dtype: float64

But as a result they have somewhat different means and standard deviations.

In [26]:

normalized.mean()

Out[26]:

ladder        0.554455
log_gdp       0.565357
social        0.746725
life_exp      0.608947
freedom       0.668690
generosity    0.332345
corruption    0.754826
dtype: float64

In [27]:

normalized.std()

Out[27]:

ladder        0.212192
log_gdp       0.242351
social        0.185365
life_exp      0.223317
freedom       0.203633
generosity    0.176200
corruption    0.212124
dtype: float64

Here’s the regression with normalized variables.

In [28]:

results_normalized = smf.ols(formula, data=normalized).fit()
results_normalized.params

Out[28]:

Intercept    -0.030706
log_gdp       0.216678
social        0.340407
life_exp      0.212877
freedom       0.196069
generosity    0.067483
corruption   -0.098962
dtype: float64

We’ll look more closely at those results soon, but first let’s look at one more preprocessing option, transforming values to follow a normal distribution.

Transform to normal¶

Here’s how we transform values to a normal distribution.

In [29]:

n = len(data)
ps = np.arange(1, n+1) / (n + 1)
zs = norm.ppf(ps)

transformed = data.copy()
for column in transformed.columns:
    transformed.sort_values(column, inplace=True)
    transformed[column] = zs

The ps are the cumulative probabilities; the zs are the corresponding quantiles in a standard normal distribution (ppf computes the “percent point function” which is an unnecessary name for the inverse CDF). Inside the loop, we sort transformed by one of the columns and then assign the zs in that order.

The transformed means are all close to 0.

In [30]:

transformed.mean()

Out[30]:

ladder        4.644070e-17
log_gdp       4.644070e-17
social        1.161018e-17
life_exp      0.000000e+00
freedom       0.000000e+00
generosity    4.644070e-17
corruption   -4.644070e-17
dtype: float64

And the standard deviations are close to 1 (but not exact because the tails of the distribution are cut off).

In [31]:

transformed.std()

Out[31]:

ladder        0.975034
log_gdp       0.975034
social        0.975034
life_exp      0.975034
freedom       0.975034
generosity    0.975034
corruption    0.975034
dtype: float64

As intended, the transformed variables follow a normal distribution.

In [32]:

plot_normal_model(transformed['corruption'], label='corruption')
decorate(ylabel='CDF')

Here’s the regression model with the transformed values.

In [33]:

results_transformed = smf.ols(formula, data=transformed).fit()
results_transformed.params

Out[33]:

Intercept     5.425248e-18
log_gdp       2.382770e-01
social        3.642517e-01
life_exp      1.951131e-01
freedom       2.028179e-01
generosity    3.216042e-02
corruption   -4.517596e-02
dtype: float64

Now let’s compare the estimated parameters with the three transformations.

In [34]:

results_standardized.params.plot(style='o', label='standardized')
results_normalized.params.plot(style='x', label='normalized')
results_transformed.params.plot(style='+', label='transformed')

decorate(ylabel='Estimated parameter')

The results with standardized variables are the easiest to interpret, so in that sense they are “right”. The other transformations have no advantages over standardization, and they produce somewhat different parameters.

Discussion¶

Regression does not require the dependent or independent variables to follow a normal distribution. It only assumes that the distribution of residuals is approximately normal, and even that requirement is not strict — regression is quite robust.

You do not need to check whether any of these variables follows a normal distribution, and you definitely should not use a statistical test.

As part of your usual exploratory data analysis, you should look at the distribution of all variables, but that’s primarily to detect any anomalies, outliers, or really extreme distributions — not to check whether they fit normal distributions.

If you want to use the parameters of the model to quantify the importance of the independent variables, use standardization. There’s no reason to use normalization, and definitely no reason to transform to a normal distribution.

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

Have the Nones Leveled Off?

June 29, 2024 AllenDowney

Last month Ryan Burge published “The Nones Have Hit a Ceiling“, using data from the 2023 Cooperative Election Study to show that the increase in the number of Americans with no religious affiliation has hit a plateau. Comparing the number of Atheists, Agnostics, and “Nothing in Particular” between 2020 and 2023, he found that “the share of non-religious Americans has stopped rising in any meaningful way.”

When I read that, I was frustrated that the HERI Freshman Survey had not published new data since 2019. I’ve been following the rise of the “Nones” in that dataset since one of my first blog articles.

As you might guess, the Freshman Survey reports data from incoming college students. Of course, college students are not a representative sample of the U.S. population, and as rates of college attendance have increased, they represent a different slice of the population over time. Nevertheless, surveying young adults over a long interval provides an early view of trends in the general population.

Well, I have good news! I got a notification today that HERI has published data tables for the 2020 through 2023 surveys. They are in PDF, so I had to do some manual data entry, but I have results!

Religious preference

Among other questions, the Freshman Survey asks students to select their “current religious preference” from a list of seventeen common religions, “Other religion,” “Atheist”, “Agnostic”, or “None.”

The options “Atheist” and “Agnostic” were added in 2015. For consistency over time, I compare the “Nones” from previous years with the sum of “None”, “Atheist” and “Agnostic” since 2015.

The following figure shows the fraction of Nones from 1969, when the question was added, to 2023, the most recent data available.

The blue line shows data until 2015; the orange line shows data from 2015 through 2019. The gray line shows a quadratic fit. The light gray region shows a 95% predictive interval.

The quadratic model continues to fit the data well and the recent trend is still increasing, but if you look at only the last few data points, there is some evidence that the rate of increase is slowing.

But not for women

Now here’s where things get interesting. Until recently, female students have been consistently more religious than male students. But that might be changing. The following figure shows the percentages of Nones for male and female students (with a missing point in 2018, when this breakdown was not available).

Since 2019, the percentage of Nones has increased for women and decreased for men, and it looks like women may now be less religious. So the apparent slowdown in the overall trend might be a mix of opposite trends in the two groups.

The following graph shows the gender gap over time, that is, the difference in percentages of male and female students with no religious affiliation.

The gap was essentially unchanged from 1990 to 2020. But in the last three years it has changed drastically. It now falls outside the predictive range based on past data, which suggests a change this large would be unlikely by chance.

Similarly with attendance at religious services, the gender gap has closed and possibly reversed.

UPDATE: Ryan Burge looked at the gender gap in CES and GSS data and found similar results: especially among young people, the gender gap has either disappeared or crossed over. And Ryan pointed me to this article by Dan Cox and Kelsey Eyre Hammond which reports similar trends in data from the Survey Center on American Life.

Attendance

The survey also asks students how often they “attended a religious service” in the last year. The choices are “Frequently,” “Occasionally,” and “Not at all.” Respondents are instructed to select “Occasionally” if they attended one or more times, so a wedding or a funeral would do it.

The following figure shows the fraction of students who reported any religious attendance in the last year, starting in 1968. I discarded a data point from 1966 that seems unlikely to be correct.

There is a clear dip in 2021, likely due to the pandemic, but the last two data points have returned to the long-term trend.

Data Source

The data reported here are available from the HERI publications page. Since I entered the data manually from PDF documents, it’s possible I have made errors.

Should divorce be more difficult?

June 14, 2024 AllenDowney

“The Christian right is coming for divorce next,” according to this recent Vox article, and “Some conservatives want to make it a lot harder to dissolve a marriage.”

As always when I read an article like this, I want to see data — and the General Social Survey has just the data I need. Since 1974, they have asked a representative sample of the U.S. population, “Should divorce in this country be easier or more difficult to obtain than it is now?” with the options to respond “Easier”, “More difficult”, or “Stay as is”.

Here’s how the responses have changed over time:

Since the 1990s, the percentage saying divorce should be more difficult has dropped from about 50% to about 30%. [The last data point, in 2022, may not be reliable. Due to disruptions during the COVID pandemic, the GSS changed some elements of their survey process — in the 2021 and 2022 data, responses to several questions have deviated from long-term trends in ways that might not reflect real changes in opinion.]

If we break down the results by political alignment, we can see whether these changes are driven by liberals, conservatives, or both.

Not surprisingly, conservatives are more likely than liberals to believe that divorce should be more difficult, by a margin of about 20 percentage points. But the percentages have declined in all groups — and fallen below 50% even among self-described conservatives.

As the Vox article documents, conservatives in several states have proposed legislation to make divorce more difficult. Based on the data, these proposals are likely to be unpopular.

To see my analysis, you can run this notebook on Colab. For similar analysis of other topics, see Chapter 11 of Probably Overthinking It.

Which Standard Deviation?

June 8, 2024 AllenDowney

It’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

standard_dev

Which Standard Deviation?¶

Here’s a question from the Reddit statistics forum.

When do we use N and when N-1 for computation of sd and cov?

So I was doing a task, where I had a portfolio of shares and I was given their yields. Then I had to calculate [covariance], [standard deviation] etc. But I simply do not know when to use N and when N-1. I only know that it has to do something with degrees of freedom. Can someone explain it to me like to a 10 year old? Thanks!

If you look up the formula for standard deviation, you are likely to find two versions. One has the sample size, N, in the denominator; the other has N-1.

Sometimes the explanation of when you should use each of them is not clear. And to make it more confusing, some software packages compute the N version by default, and some the N-1 version.

Let’s see if we can straighten it all out.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

In [2]:

# install the empiricaldist library, if necessary

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

Samples and Populations¶

Suppose we are given a sequence of numbers and asked to compute the standard deviation. As an example, we’ll use the sequence from 0 to 9.

In [3]:

N = 10
data = np.arange(N)
data

Out[3]:

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

The first step is to compute the mean.

In [4]:

m = np.mean(data)
m

Out[4]:

4.5

The deviations are the distances of each point from the mean.

In [5]:

deviations = data - m
deviations

Out[5]:

array([-4.5, -3.5, -2.5, -1.5, -0.5,  0.5,  1.5,  2.5,  3.5,  4.5])

Next we compute the sum of the squared deviations.

In [6]:

ssd = np.sum(deviations **2)
ssd

Out[6]:

82.5

Then we compute the variance — we’ll start with the version that has N in the denominator.

In [7]:

var = ssd / N

Finally, the standard deviation is the square root of variance.

In [8]:

std = np.sqrt(var)
std

Out[8]:

2.8722813232690143

And here’s the version with N-1 in the denominator.

In [9]:

var = ssd / (N-1)
std = np.sqrt(var)
std

Out[9]:

3.0276503540974917

With N=10, the difference between the version is non-negligible, but this is pretty much the worst case. With larger sample sizes, the difference in smaller — and with smaller sample sizes, you probably shouldn’t compute a standard deviation at all.

By default, NumPy computes the N version.

In [10]:

np.std(data)

Out[10]:

2.8722813232690143

But with the optional argument ddof=1, it computes the N-1 version.

In [11]:

np.std(data, ddof=1)

Out[11]:

3.0276503540974917

By default, Pandas computes the N-1 version.

In [12]:

pd.Series(data).std()

Out[12]:

3.0276503540974917

But with the optional argument ddof=0, it computes the N version.

In [13]:

pd.Series(data).std(ddof=0)

Out[13]:

2.8722813232690143

It is not ideal that the two libraries have different default behavior. And it might not be obvious why the parameter that controls this behavior is called ddof.

The answer is related to OP’s question about “degrees of freedom”. To understand that term, suppose I ask you to think of three numbers. You are free to choose any first number, any second number, and any third number. In that case, you have three degrees of freedom.

Now suppose I ask you to think of three numbers, but they are required to add up to 10. You are free to choose the first and second numbers, but then the third number is determined by the requirement. So you have only two degrees of freedom.

If we are given a dataset with N elements, we generally assume that it has N degrees of freedom unless we are told otherwise. That’s what the ddof parameter does. It stands for “delta degrees of freedom”, where “delta” indicates a change or a difference. In this case, it is the difference between the presumed degrees of freedom, N, and the degrees of freedom that should be used for the computation, N - ddof.

So, with ddof=0, the denominator is N. With ddof=1, the denominator is N-1.

Which one is right?¶

So when should you use one or the other? It depends on whether you are describing data or making a statistical inference.

The N version is a descriptive statistic — it is quantifies the variability of the data.
The N-1 version is an estimator — if data is a sample from a population, we can use it to infer the standard deviation of the population.

To be more specific, the N-1 version is an almost unbiased estimator, which means that it gets the answer almost right, on average, in the long run. Let’s see how that works.

Consider a uniform distribution from 0 to 10.

In [14]:

from scipy.stats import uniform

dist = uniform(0, 10)

Here are the actual mean and standard deviation of this distribution, computed analytically.

In [15]:

dist.mean(), dist.std()

Out[15]:

(5.0, 2.8867513459481287)

If we generate a large random sample from this distribution, we expect its standard deviation to be close to 2.9.

In [16]:

sample = dist.rvs(size=10000)
sample.std(ddof=0), sample.std(ddof=1)

Out[16]:

(2.8816683774305085, 2.881812471656537)

It is, and with a large sample size, the difference between the N version and the N-1 version is negligible.

But let’s see what happens if we generate small samples many times. First we’ll compute the N version of the standard deviation for 10001 samples.

In [17]:

standard_deviations_N = [dist.rvs(N).std(ddof=0) for i in range(10001)]

Ideally, the mean of these estimates should be close to the actual mean of the distribution, which is about 2.9.

In [18]:

np.mean(standard_deviations_N)

Out[18]:

2.6966272770954323

But it’s not — it is too low on average, which means it is a biased estimator.

Let’s see if the N-1 version is better.

In [19]:

standard_deviations_Nminus1 = [dist.rvs(N).std(ddof=1) for i in range(10001)]

np.mean(standard_deviations_Nminus1)

Out[19]:

2.8482429099753923

Yes! The N-1 version is a much less biased estimator of the standard deviation.

That’s why the N-1 version is called the “sample standard deviation”, because it is appropriate when we are using a sample to estimate the standard deviation of a population.

If, instead, we are able to measure an entire population, we can use the N version — which is why it is called the “population standard deviation”.

Should we care?¶

On one hand, it is impressive that such a simple correction yields a much better estimator. On the other hand, it almost never matters in practice.

If you have a large sample, the difference between the two versions is negligible.
If you have a small sample, you can’t make a precise estimate of the standard deviation anyway.

To demonstrate the second point, let’s look at the distributions of the estimates using the two versions.

In [20]:

sns.kdeplot(standard_deviations_N, label='N version')
sns.kdeplot(standard_deviations_Nminus1, label='N-1 version')

decorate(xlabel='Estimated standard deviation')

Compared to the variation in the estimates, the difference between the versions is small.

In summary, if your sample size is small and it is really important to avoid underestimating the standard deviation, use the N-1 correction. Otherwise it doesn’t matter.

In [23]:

data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
np.std(data)

Out[23]:

2.8722813232690143

In [24]:

pd.Series(data).std()

Out[24]:

3.0276503540974917

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

What is a percentile rank?

June 1, 2024 AllenDowney

It’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

percentile_rank

What is a Percentile Rank?¶

Here’s a question from the Reddit statistics forum.

What’s the difference between cumulative frequency as a percentage and percentiles?

Example that got me super confused:

[Here’s the data provided by OP]:

In [1]:

import pandas as pd

columns = ['score', 'frequency', 'cumulative frequency', 'cumulative percentage']

data = [
    (8, 1, 1, 2),
    (9, 1, 2, 4),
    (10, 1, 3, 6),
    (11, 3, 6, 13),
    (12, 5, 11, 23),
    (13, 7, 18, 38),
    (14, 8, 26, 54),
    (15, 9, 35, 73),
    (16, 6, 41, 85),
    (17, 4, 45, 94),
    (18, 2, 47, 98),
    (19, 1, 48, 100),
]
df = pd.DataFrame(data, columns=columns)
df

Out[1]:

	score	frequency	cumulative frequency	cumulative percentage
0	8	1	1	2
1	9	1	2	4
2	10	1	3	6
3	11	3	6	13
4	12	5	11	23
5	13	7	18	38
6	14	8	26	54
7	15	9	35	73
8	16	6	41	85
9	17	4	45	94
10	18	2	47	98
11	19	1	48	100

OP continues:

To me, the last [column], the “cumulative frequency” expressed in percentages is a percentile: it’s the percentage of scores lower then or the same as the score it’s calculated for.

However, in my class it’s explained that the percentile for score 14 would be 46, and that the percentile for score 17 would be 90, not 94.

The way they arrive at this is that upper and lower limits for the percentile calculation have to be set, in the case of score 14 that would be 14 and 13, and then the percentile would be calculated as (upper limit+lower limit)/2.

In the case of 14: (38+54)/2 = 46

This makes absolutely no sense to me: why are we introducing arbitrary limits to average out when the cumulative frequency in percentages, to me, seems to meet the definition of a percentile perfectly?

What’s at issue here is the definition of percentile and percentile rank. As an example, let’s consider the distribution of height, and suppose the median is 168 cm. In that case, 168 cm is the 50th percentile, and if someone is 168 cm tall, their percentile rank is 50%.

By this definition, the percentile rank for a particular quantity is its cumulative frequency expressed as a percentage, exactly as OP suggests. If you are taller than, or the same height as, 50% of the population, your percentile rank is 50%.

However, some classes teach the alternative definition OP presents. So, let me explain the two definitions, and we can discuss the pros and cons. At the risk of giving away the ending, here is my conclusion:

Percentiles and percentile ranks are perfectly well defined, and I don’t think we should encourage variations on the definitions.
In a dataset with many repeated values, percentile ranks might not measure what we want — in that case we might want to compute a different statistic, but then we should give it a different name.

In other words, I agree with OP.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [2]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

In [3]:

# install the empiricaldist library, if necessary

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

Percentiles and percentile ranks¶

From the dataset OP provided, we can extract the scores and their frequencies.

In [4]:

pairs = df[['frequency', 'score']].values

And we can reconstitute the sample by making the given number of copies of each score.

In [5]:

sample = np.concatenate([[score] * freq for freq, score in pairs])
sample

Out[5]:

array([ 8,  9, 10, 11, 11, 11, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13,
       13, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15,
       15, 16, 16, 16, 16, 16, 16, 17, 17, 17, 17, 18, 18, 19])

Now suppose we want to compute the median score and the interquartile range (IQR). We can use the NumPy function percentile to compute the 25th, 50th, and 75th percentiles.

In [6]:

np.percentile(sample, [25, 50, 75])

Out[6]:

array([13., 14., 16.])

So the median score is 14 and the IQR is 16 - 13 = 3.

Going in the other direction, if we are given a score, q, we can compute its percentile rank by computing the fraction of scores less than or equal to q, expressed as a percentage.

In [7]:

q = 14
np.mean(sample <= q) * 100

Out[7]:

54.166666666666664

If someone gets the median score of 14, their percentile rank is about 54%.

Here we see the first thing that bothers people about percentiles and percentile ranks: with a finite dataset, they are not invertible. If you compute the percentile rank of the 50th percentile, the answer is not always 50%. To see why, let’s consider the CDF.

The CDF¶

Percentiles and percentile ranks are closely related to the the cumulative distribution function (CDF). To demonstrate, we can use empiricaldist to make a Cdf object from the reconstituted sample.

In [8]:

from empiricaldist import Cdf

cdf = Cdf.from_seq(sample)
cdf

Out[8]:

	probs
8	0.020833
9	0.041667
10	0.062500
11	0.125000
12	0.229167
13	0.375000
14	0.541667
15	0.729167
16	0.854167
17	0.937500
18	0.979167
19	1.000000

A Cdf object is a Pandas Series that contains the observed quantities as an index and their cumulative probabilities as values. We can use square brackets to look up a quantity and get a cumulative probability.

In [9]:

cdf[14]

Out[9]:

0.5416666666666666

If we multiply by 100, we get the percentile rank.

But square brackets only work with quantities in the dataset. If we look up any other quantity, that’s an error. However, we can use parentheses to call the Cdf object like a function.

In [10]:

cdf(14)

Out[10]:

array(0.54166667)

And that works with any numerical quantity.

In [11]:

cdf(13.5)

Out[11]:

array(0.375)

Cdf provides a step method that plots the CDF as a step function, which is what it technically is.

In [12]:

cdf.step(label='')
decorate(ylabel='CDF')

To be more explicit, we can put markers at the top of each vertical segment to indicate how the function is evaluated at one of the observed quantities.

In [13]:

cdf.step(label='')
cdf.plot(style='o', label='')
decorate(ylabel='CDF')

The Cdf object provides a method called inverse that computes the inverse CDF — that is, if you give it a cumulative probability, it computes the corresponding quantity.

In [14]:

p1 = 0.375
cdf.inverse(p1)

Out[14]:

array(13.)

The result from an inverse lookup is sometimes called a “quantile”, which is the name of a Pandas method that computes quantiles.

If we put the sample into a Series, we can use quantile to compute the quantity to look up the cumulative probability p1.

In [15]:

sample_series = pd.Series(sample)
sample_series.quantile(p1)

Out[15]:

13.625

By default, quantile uses interpolates linearly between the observed values. If we want this function to behave according to the definition of the CDF, we have to specify a different kind of interpolation.

In [16]:

sample_series.quantile(p1, interpolation='lower')

Out[16]:

If the result from the inverse CDF is called a quantile, you might wonder what we call the result from the CDF. By analogy with percentile and percentile rank, I think it should be called a quantile rank, but no one calls it that. As far as I can tell, it’s just called a cumulative probability.

So what’s wrong with percentile rank?¶

Percentiles and percentile ranks have a perfectly good definition, which follows from the definition of the CDF. So what’s the problem? Well, I have to admit — the definition is a little arbitrary.

In the example, suppose your score is 14. Your score is strictly better than 37.5% of the other scores.

In [17]:

less = np.mean(sample < 14)
less

Out[17]:

0.375

It’s strictly worse than 45.8%

In [18]:

more = np.mean(sample > 14)
more

Out[18]:

0.4583333333333333

And equal to 16.7%.

In [19]:

same = np.mean(sample == 14)
same

Out[19]:

0.16666666666666666

So if we want a single number that quantifies your performance relative to the rest of the class, which number should we use?

The definition of the CDF suggests we should report the fraction of the class whose score is less than or equal to yours. But that is an arbitrary choice. We could just as easily report the fraction whose score is strictly less — or the midpoint of these extremes. For a score of 14, here’s the midpoint:

In [20]:

less + same / 2

Out[20]:

0.4583333333333333

That’s the result OP’s teacher was expecting, and to be fair, that’s Wikipedia’s definition of percentile rank. But I don’t like it.

Discussion¶

I prefer the CDF-based definition of percentile rank because it’s consistent with the way most computational tools work. The midpoint-based definition feels like a holdover from the days of working with small datasets by hand.

That’s just my preference — if people want to compute midpoints, I won’t stop them. But for the sake of clarity, we should give different names to different statistics.

Historically, I think the CDF-based definition has the stronger claim on the term “percentile rank”. For the midpoint-based definition, ChatGPT suggests “midpoint percentile rank” or “average percentile rank”. Those seem fine to me, but it doesn’t look like they are widely used.

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

Logarithms and Heteroskedasticity

May 26, 2024 AllenDowney

Here’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

log_heterosked

Logarithms and heteroskedasticity¶

Here’s a question from the Reddit statistics forum.

Is it correct to use logarithmic transformation in order to mitigate heteroskedasticity?

For my studies I gathered data on certain preferences across a group of people. I am trying to figure out if I can pinpoint preferences to factors such as gender in this case.

I used mixed ANOVA analysis with good success however one of my hypothesis came up with heteroskedasticity when doing Levene’s test. [I’ve been] breaking my head all day on how to solve this. I’ve now used logarithmic transformation to all 3 test results and run another Levene’s. When using the media value the test now results [in] homoskedasticity, however interaction is no longer significant?

Is this the correct way to deal with this problem or is there something I am missing? Thanks in advance to everyone taking their time to help.

Although the question is about ANOVA, I’m going to reframe it in terms of regression, for two reasons:

Discussion of heteroskedasticity is clearer in the context of regression.
For many problems, a regression model is better than ANOVA anyway.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

In [2]:

# install the empiricaldist library, if necessary

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

What is heteroskedasticity?¶

Linear regression is based on a model of the data-generating process where the independent variable y is the sum of

A linear function of x with an unknown slope and intercept, and
Random values drawn from a Gaussian distribution with mean 0 and unknown standard deviation, sigma, that does not depend on x.

If, contrary to the second assumption, sigma depends on x, the data-generating process is heteroskedastic. Some amount of heteroskedasticity is common in real data, but most of the time it’s not a problem, because:

Heteroskedasticity doesn’t bias the estimated parameters of the model, only the standard errors,
Even very strong heteroskedasticity doesn’t affect the standard errors by much, and
For practical purposes we don’t need standard errors to be particularly precise.

To demonstrate, let’s generate some data. First, we’ll draw xs from a normal distribution.

In [3]:

np.random.seed(17)

n = 200
xs = np.random.normal(30, 1, size=n)
xs.sort()

To generate heteroskedastic data, I’ll use interpolate to construct a function where sigma depends on x.

In [4]:

from scipy.interpolate import interp1d

def interpolate(xs, sigma_seq):
    return interp1d([xs.min(), xs.max()], sigma_seq)(xs)

To generate strong heteroskedasticity, I’ll vary sigma over a wide range.

In [5]:

sigmas = interpolate(xs, [0.1, 6.0])
np.mean(sigmas)

Out[5]:

3.126391153924031

Here’s what sigma looks like as a function of x.

In [6]:

plt.plot(xs, sigmas, '.')

decorate(xlabel='x',
         ylabel='sigma')

Now we can generate ys with variable values of sigma.

In [7]:

ys = xs + np.random.normal(0, sigmas)

If we make a scatter plot of the data, we see a cone shape that indicates heteroskedasticity.

In [8]:

plt.plot(xs, ys, '.')

decorate(xlabel='x', ylabel='y')

Now let’s fit a model to the data.

In [9]:

import statsmodels.api as sm

X = sm.add_constant(xs)
ols_model = sm.OLS(ys, X)
ols_results = ols_model.fit()

intercept, slope = ols_results.params
intercept, slope

Out[9]:

(0.7580177696902339, 0.9672433174107101)

Here’s what the fitted line looks like.

In [10]:

fys = intercept + slope * xs

plt.plot(xs, ys, '.')
plt.plot(xs, fys)

decorate(xlabel='x', ylabel='y')

If we plot the absolute values of the residuals, we can see the heteroskedasticity more clearly.

In [11]:

resid = ys - fys
plt.plot(xs, np.abs(resid), '.')

decorate(xlabel='x', ylabel='absolute residual')

Testing for heteroskedasticity¶

OP mentions using the Levene test for heteroskedasticity, which is used to test whether sigma is different between groups. For continuous values of x and y, we can use the Breusch-Pagan Lagrange Multiplier test:

In [12]:

from statsmodels.stats.diagnostic import het_breuschpagan

_, p_value, _, _ = het_breuschpagan(resid, ols_model.exog)
p_value

Out[12]:

0.0006411829020109725

Or White’s Lagrange Multiplier test:

In [13]:

from statsmodels.stats.diagnostic import het_white

_, p_value, _, _ = het_white(resid, ols_model.exog)
p_value

Out[13]:

0.0019263142806157931

Both tests produce small p-values, which means that if we generate a dataset by a homoskedastic process, there is almost no chance it would have as much heteroskedasticity as the dataset we generated.

If you have never heard of either of these tests, don’t panic — neither had I under I looked them up for this example. And don’t worry about remembering them, because you should never use them again. Like testing for normality, testing for heteroskedasticity is never useful.

Why? Because in almost any real dataset, you will find some heteroskedasticity. So if you test for it, there are only two possible results:

If the heteroskedasticity is small and you don’t have much data, you will fail to reject the null hypothesis.
If the heteroskedasticity is large or you have a lot of data, you will reject the null hypothesis.

Either way, you learn nothing — and in particular, you don’t learn the answer to the question you actually care about, which is whether the heteroskedasticity is so large that the effect on the standard errors is large enough that you should care.

And the answer to that question is almost always no.

Should we care?¶

The dataset we generated has very large heteroskedasticity. Let’s see how much effect that has on the results. Here are the standard errors from simple linear regression:

In [14]:

ols_results.bse

Out[14]:

array([6.06159018, 0.20152957])

Now, there are several ways to generate standard errors that are robust in the presence of heteroskedasticity. One is the Huber-White estimator, which we can compute like this:

In [15]:

robust_se = ols_results.get_robustcov_results(cov_type='HC3')
robust_se.bse

Out[15]:

array([6.73913518, 0.2268012 ])

Another is to use Huber regression.

In [16]:

huber_model = sm.RLM(ys, X, M=sm.robust.norms.HuberT())
huber_results = huber_model.fit()
huber_results.bse

Out[16]:

array([5.92709031, 0.19705786])

Another is to use quantile regression.

In [17]:

quantile_model = sm.QuantReg(ys, X)
quantile_results = quantile_model.fit(q=0.5)
quantile_results.bse

Out[17]:

array([7.6449323 , 0.25417092])

And one more option is a wild bootstrap, which resamples the residuals by multiplying them by a random sequence of 1 and -1. This way of resampling preserves heteroskedasticity, because it only changes the sign of the residuals, not the magnitude, and it maintains the relationship between those magnitudes and x.

In [18]:

from scipy.stats import linregress

def wild_bootstrap():
    resampled = fys + ols_results.resid * np.random.choice([1, -1], size=n)
    res = linregress(xs, resampled)
    return res.intercept, res.slope

We can use wild_bootstrap to generate a sample from the sampling distributions of the intercept and slope.

In [19]:

sample = [wild_bootstrap() for i in range(1001)]

The standard deviation of the sampling distributions is the standard error.

In [20]:

bootstrap_bse = np.std(sample, axis=0)
bootstrap_bse

Out[20]:

array([6.63622313, 0.22341784])

Now let’s put all of the result in a table.

In [21]:

columns = ['SE(intercept)', 'SE(slope)']
index = ['OLS', 'Huber-White', 'Huber', 'quantile', 'bootstrap']
data = [ols_results.bse, robust_se.bse, huber_results.bse, 
        quantile_results.bse, bootstrap_bse]
df = pd.DataFrame(data, columns=columns, index=index)
df.sort_values(by='SE(slope)')

Out[21]:

	SE(intercept)	SE(slope)
Huber	5.927090	0.197058
OLS	6.061590	0.201530
bootstrap	6.636223	0.223418
Huber-White	6.739135	0.226801
quantile	7.644932	0.254171

The standard errors we get from different methods are notably different, but the differences probably don’t matter.

First, I am skeptical of the results from Huber regression. With this kind of heteroskedasticity, the standard errors should be larger than what we get from OLS. I’m not sure what’s the problem is, and I haven’t bothered to find out, because I don’t think Huber regression is necessary in the first place.

The results from bootstrapping and the Huber-White estimator are the most reliable — which suggests that the standard errors from quantile regression are too big.

In my opinion, we don’t need esoteric methods to deal with heteroskedasticity. If heteroskedasticity is extreme, consider using wild bootstrap. Otherwise, just use ordinary least squares.

Now let’s address OP’s headline question, “Is it correct to use logarithmic transformation in order to mitigate heteroskedasticity?”

Log transform help?¶

In some cases, a log transform can reduce or eliminate heteroskedasticity. However, there are several reasons this is not a good idea in general:

As we’ve seen, heteroskedasticity is not a big problem, so it usually doesn’t require any mitigation.
Taking a log transform of one or more variables in a regression model changes the meaning of the model — it hypothesizes a relationship between the variables that might not make sense in context.
Anyway, taking a log transform doesn’t always help.

To demonstrate the last point, let’s see what happens if we apply a log transform to the dependent variable:

In [22]:

log_ys = np.log10(ys)

Here’s what the scatter plot looks like after the transform.

In [23]:

plt.plot(xs, log_ys, '.')

decorate(xlabel='x', ylabel='log10 y')

Here’s what we get if we fit a model to the data.

In [24]:

ols_model_log = sm.OLS(log_ys, X)
ols_results_log = ols_model_log.fit()

intercept, slope = ols_results_log.params
intercept, slope

Out[24]:

(1.072059757295944, 0.013315487289434717)

And here’s the fitted line.

In [25]:

log_fys = intercept + slope * xs

plt.plot(xs, log_ys, '.')
plt.plot(xs, log_fys)

decorate(xlabel='x', ylabel='log10 y')

If we plot the absolute values of the residuals, we can see that the log transform did not entirely remove the heteroskedasticity.

In [26]:

log_resid = log_ys - log_fys
plt.plot(xs, np.abs(log_resid), '.')

decorate(xlabel='x', ylabel='absolute residual on log y')

Which we can confirm by running the tests again (which we should never do).

In [27]:

_, p_value, _, _ = het_breuschpagan(log_resid, ols_model_log.exog)
p_value

Out[27]:

0.002154782205265498

In [28]:

_, p_value, _, _ = het_white(log_resid, ols_model_log.exog)
p_value

Out[28]:

0.006069762292696221

The p-values are bigger, which suggests that the log transform mitigated the heteroskedasticity a little. But if the goal was to eliminate heteroskedasticity, the log transform didn’t do it.

Discussion¶

To summarize:

Heteroskedasticity is common in real datasets — if you test for it, you will often find it, provided you have enough data.
Either way, testing does not answer the question you really care about, which is whether the heteroskedasticity is extreme enough to be a problem.
Plain old linear regression is robust to heteroskedasticity, so unless it is really extreme, it is probably not a problem.
Even in the worst case, heteroskedasticity does not bias the estimated parameters — it only affects the standard errors — and we don’t need standard errors to be particularly precise anyway.
Although a log transform can sometimes mitigate heteroskedasticity, it doesn’t always succeed, and even if it does, it’s usually not necessary.
A log transform changes the meaning of the regression model in ways that might not make sense in context.

So, use a log transform if it makes sense in context, not to mitigate a problem that’s not much of a problem in the first place.

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

In [ ]:

Combining Risks

May 24, 2024 AllenDowney

Here’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

combine_risk

Combining Risks¶

Here’s a question from the Reddit statistics forum.

Bit of a weird one but I’m hoping you’re the community to help. I work in children’s residential care and I’m trying to find a way of better matching potential young people together.

The way we calculate individual risk for a child is

risk = likelihood + impact (R=L+I), so

L4 + I5 = R9

That works well for individuals but I need to work out a good way of calculating a combined risk to place children [in] the home together. I’m currently using the [arithmetic] average but I don’t feel that it works properly as the average is always lower then the highest risk.

I’ll use a fairly light risk as an example, running away from the home. (We call this MFC missing from care) It’s fairly common that one of the kids will run away from the home at some point or another either out of boredom or frustration. If young person A has a risk of 9 and young person B has a risk of 12 the the average risk of MFC in the home would be 10.5

HOWEVER more often then not having two young people that go MFC will often result in more episodes as they will run off together, so having a lower risk rating doesn’t really make sense. Adding the two together to 21 doesn’t really work either though as the likelihood is the thing that increases not necessarily the impact.

I’m a lot better at chasing after run away kids then I am mathematics so please help 😂.

Here’s one way to think about this question: based on background knowledge and experience, OP has qualitative ideas about what happens when we put children at different risks together, and he is looking for a statistical summary that is consistent with these ideas.

The arithmetic mean probably makes sense as a starting point, but it clashes with the observation that if you put two children together who are high risk, they interact in ways that increase the risk. For example, if we put together children with risks 9 and 12, the average is 10.5, and OP’s domain knowledge says that’s too low — it should be more than 12.

At the other extreme, I’ll guess that putting together two low risk children might be beneficial to both — so the combined risk might be lower than either.

And that implies that there is a neutral point somewhere in the middle, where the combined risk is equal to the individual risks.

To construct a summary statistic like that, I suggest a weighted sum of the arithmetic and geometric means. That might sound strange, but I’ll show that it has the properties we want. And it might not be as strange as it sounds — there’s a reason it might make sense.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [18]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

In [19]:

# install the empiricaldist library, if necessary

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

Weighted sum of means¶

The following function computes the arithmetic mean of a sequence of values, which is the sum divided by n.

In [20]:

def amean(xs):
    n = len(xs)
    return np.sum(xs) / n

The following function computes the geometric mean of a sequence, which is the product raised to the power 1/n.

In [21]:

def gmean(xs):
    n = len(xs)
    return np.prod(xs) ** (1/n)

And the following function computes the weighted sum of the arithmetic and geometric means. The constant k determines how much weight we give the geometric mean.

In [22]:

def mean_plus_gmean(*xs, k=1):
    return amean(xs) + k * (gmean(xs) - 4)

The value 4 determines the neutral point. So if we put together two people with risk 4, the combined average is 4.

In [23]:

mean_plus_gmean(4, 4)

Out[23]:

4.0

Above the neutral point, there is a penalty if we put together two children with higher risks.

In [24]:

mean_plus_gmean(5, 5)

Out[24]:

6.0

In that case, the combined risk is higher than the individual risks. Below the neutral point, there is a bonus if we put together two children with low risks.

In [25]:

mean_plus_gmean(3, 3)

Out[25]:

2.0

In that case, the combined risk is less than the individual risks.

If we combine low and high risks, the discrepancy brings the average down a little.

In [26]:

mean_plus_gmean(3, 5)

Out[26]:

3.872983346207417

In the example OP presented, where we put together two people with high risk, the penalty is substantial.

In [27]:

mean_plus_gmean(9, 12)

Out[27]:

16.892304845413264

If that penalty seems too high, we can adjust the weight, k, and the neutral point accordingly.

This behavior extends to more than two people. If everyone is neutral, the result is neutral.

In [28]:

mean_plus_gmean(4, 4, 4)

Out[28]:

3.9999999999999996

If you add one person with higher risk, there’s a moderate penalty, compared to the arithmetic mean.

In [29]:

mean_plus_gmean(4, 4, 5), amean([4, 4, 5])

Out[29]:

(4.6422027133971, 4.333333333333333)

With two higher risk people, the penalty is bigger.

In [30]:

mean_plus_gmean(4, 5, 5), amean([4, 5, 5])

Out[30]:

(5.308255500279445, 4.666666666666667)

And with three it is even bigger.

In [31]:

mean_plus_gmean(5, 5, 5), amean([5, 5, 5])

Out[31]:

(5.999999999999999, 5.0)

Does this make any sense?¶

The idea behind this suggestion is logistic regression with an interaction term. Let me explain where that comes from. OP explained:

The way we calculate individual risk for a child is

risk = likelihood + impact (R=L+I), so

L4 + I5 = R9

At first I thought it was strange to add a likelihood and an impact score, Thinking about expected value, I thought it should be the product of a probability and a cost. But if both are on a log scale, adding these logarithms is like multiplying probability by impact on a linear scale, so that makes more sense.

And if the scores are consistent with something like a log-odds scale, we can see a connection with logistic regression. If r1 and r2 are risk scores, we can imagine a regression equation that looks like this, where p is the probability of an outcome like “missing from care”:

logit(p) = a r1 + b r2 + c r1 r2

In this equation, logit(p) is the combined risk score, a, b, and c are unknown parameters, and the product r1 r2 is an interaction term that captures the tendency of high risks to amplify each other.

With enough data, we could estimate the unknown parameters. Without data, the best we can do is chose values that make the results consistent with expectations.

Since r1 and r2 are interchangeable, they have to have the same parameter. And since the whole risk scale has an unspecified zero point, we can set it a and b to 1/2. Which means there is only one parameter left, the weight of the interaction term.

logit(p) = (r_1 + r2) / 2 + k r1 r2

Now we can see that the first term is the arithmetic mean and the second term is close to the geometric mean, but without the square root.

So the function I suggested — the weighted sum of arithmetic and weighted means — is not identical to the logistic model, but it is motivated by it.

With this rationale in mind, we might consider a revision: rather than add the likelihood and impact scores, and then compute the weighted sum of means, it might make more sense to separate likelihood and impact, compute the weighted sum of the means of the likelihoods, and then add back the impact.

Computing by hand¶

In case the Python code makes it hard to see what’s going on, let’s work an example by hand. Suppose r1 is 9 and r2 is 12.

In [32]:

r1 = 9
r2 = 12

Here’s the arithmetic mean.

In [33]:

m1 = (9 + 12) / 2
m1

Out[33]:

10.5

Here’s the geometric mean.

In [34]:

m2 = np.sqrt(9 * 12)
m2

Out[34]:

10.392304845413264

And here’s how we combine them.

In [35]:

k = 1
combined_risk = m1 + k * (m2 - 4)
combined_risk

Out[35]:

16.892304845413264

Discussion¶

This question got my attention because OP is working on a challenging and important problem — and they provided useful context. It’s an intriguing idea to define something that is intuitively like an average, but is not always bounded between the minimum and maximum of the data.

If we think strictly about generalized means, that’s not possible. But if we think in terms of logarithms, regression, and interaction terms, we find a way.

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

Bertrand’s Boxes

May 20, 2024 AllenDowney

An early draft of Probably Overthinking It included two chapters about probability. I still think they are interesting, but the other chapters are really about data, and the examples in these chapters are more like brain teasers — so I’ve saved them for another book. Here’s an excerpt from the chapter on Bayes theorem.

In 1889 Joseph Bertrand posed and solved one of the oldest paradoxes in probability. But his solution is not quite correct – it is right for the wrong reason.

The original statement of the problem is in his Calcul des probabilités (Gauthier-Villars, 1889). As a testament to the availability of information in the 21st century, I found a scanned copy of the book online and pasted a screenshot into an online OCR server. Then I pasted the French text into an online translation service. Here is the result, which I edited lightly for clarity:

Three boxes are identical in appearance. Each has two drawers, each drawer contains a medal. The medals in the first box are gold; those in the second box, silver; the third box contains a gold medal and a silver medal.

We choose a box; what is the probability of finding, in its drawers, a gold coin and a silver coin?

Three cases are possible and they are equally likely because the three chests are identical in appearance. Only one case is favorable. The probability is 1/3.

Having chosen a box, we open a drawer. Whatever medal one finds there, only two cases are possible. The drawer that remains closed may contain a medal whose metal may or may not differ from that of the first. Of these two cases, only one is in favor of the box whose parts are different. The probability of having got hold of this set is therefore 1/2.

How can it be, however, that it will be enough to open a drawer to change the probability and raise it from 1/3 to 1/2? The reasoning cannot be correct. Indeed, it is not.

After opening the first drawer, two cases remain possible. Of these two cases, only one is favorable, this is true, but the two cases do not have the same likelihood.

If the coin we saw is gold, the other may be silver, but we would be better off betting that it is gold.

Suppose, to show the obvious, that instead of three boxes we have three hundred. One hundred contain two gold medals, one hundred and two silver medals and one hundred one gold and one silver. In each box we open a drawer, we see therefore three hundred medals. A hundred of them are in gold and a hundred in silver, that is certain; the hundred others are doubtful, they belong to boxes whose parts are not alike: chance will regulate the number.

We must expect, when opening the three hundred drawers, to see less than two hundred gold coins the probability that the first that appears belongs to one of the hundred boxes of which the other coin is in gold is therefore greater than 1/2.

Now let me translate the paradox one more time to make the apparent contradiction clearer, and then we will resolve it.

Suppose we choose a random box, open a random drawer, and find a gold medal. What is the probability that the other drawer contains a silver medal? Bertrand offers two answers, and an argument for each:

Only one of the three boxes is mixed, so the probability that we chose it is 1/3.
When we see the gold coin, we can rule out the two-silver box. There are only two boxes left, and one of them is mixed, so the probability we chose it is 1/2.

As with so many questions in probability, we can use Bayes theorem to resolve the confusion. Initially the boxes are equally likely, so the prior probability for the mixed box is 1/3.

When we open the drawer and see a gold medal, we get some information about which box we chose. So let’s think about the likelihood of this outcome in each case:

If we chose the box with two gold medals, the likelihood of finding a gold medal is 100%.
If we chose the box with two silver medals, the likelihood is 0%.
And if we chose the box with one of each, the likelihood is 50%.

Putting these numbers into a Bayes table, here is the result:

	Prior	Likelihood	Product	Posterior
Two gold	1/3	1	1/3	2/3
Two silver	1/3	0	0	0
Mixed	1/3	1/2	1/6	1/3

The posterior probability of the mixed box is 1/3. So the first argument is correct. Initially, the probability of choosing the mixed box is 1/3 – opening a drawer and seeing a gold coin does not change it. And the Bayesian update tells us why: if there are two gold coins, rather than one, we are twice as likely to see a gold coin.

The second argument is wrong because it fails to take into account this difference in likelihood. It’s true that there are only two boxes left, but it is not true that they are equally likely. This error is analogous to the base rate fallacy, which is the error we make if we only consider the likelihoods and ignore the prior probabilities. Here, the second argument is wrong because it commits the a “likelihood fallacy” – considering only the prior probabilities and ignoring the likelihoods.

Right for the wrong reason

Bertrand’s resolution of the paradox is correct in the sense that he gets the right answer in this case. But his argument is not valid in general. He asks, “How can it be, however, that it will be enough to open a drawer to change the probability…”, implying that it is impossible in principle.

But opening the drawer does change the probabilities of the other two boxes. Having seen a gold coin, we rule out the two-silver box and increase the probability of the two-gold box. So I don’t think we can dismiss the possibility that opening the drawer could change the probability of the mixed box. It just happens, in this case, that it does not.

Let’s consider a variation of the problem where there are three drawers in each box: the first box contains three gold medals, the second contains three silver, and the third contains two gold and one silver.

In that case the likelihood of seeing a gold coin is each case is 1, 0, and 2/3, respectively. And here’s what the update looks like:

	Prior	Likelihood	Product	Posterior
Three gold	1/3	1	1/3	3/5
Three silver	1/3	0	0	0
Two gold, one silver	1/3	2/3	2/9	2/5

Now the posterior probability of the mixed box is 2/5, which is higher than the prior probability, which was 1/3. In this example, opening the drawer provides evidence that changes the probabilities of all three boxes.

I think there are two lessons we can learn from this example. The first is, don’t be too quick to assume that all cases are equally likely. The second is that new information can change probabilities in ways that are not obvious. The key is to think about the likelihoods.

	score	frequency	cumulative frequency	cumulative percentage
0	8	1	1	2
1	9	1	2	4
2	10	1	3	6
3	11	3	6	13
4	12	5	11	23
5	13	7	18	38
6	14	8	26	54
7	15	9	35	73
8	16	6	41	85
9	17	4	45	94
10	18	2	47	98
11	19	1	48	100

	score	frequency	cumulative frequency	cumulative percentage
0	8	1	1	2
1	9	1	2	4
2	10	1	3	6
3	11	3	6	13
4	12	5	11	23
5	13	7	18	38
6	14	8	26	54
7	15	9	35	73
8	16	6	41	85
9	17	4	45	94
10	18	2	47	98
11	19	1	48	100

	score	frequency	cumulative frequency	cumulative percentage
0	8	1	1	2
1	9	1	2	4
2	10	1	3	6
3	11	3	6	13
4	12	5	11	23
5	13	7	18	38
6	14	8	26	54
7	15	9	35	73
8	16	6	41	85
9	17	4	45	94
10	18	2	47	98
11	19	1	48	100