Logarithms and Heteroskedasticity

May 26, 2024 AllenDowney

Here’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

log_heterosked

Logarithms and heteroskedasticity¶

Here’s a question from the Reddit statistics forum.

Is it correct to use logarithmic transformation in order to mitigate heteroskedasticity?

For my studies I gathered data on certain preferences across a group of people. I am trying to figure out if I can pinpoint preferences to factors such as gender in this case.

I used mixed ANOVA analysis with good success however one of my hypothesis came up with heteroskedasticity when doing Levene’s test. [I’ve been] breaking my head all day on how to solve this. I’ve now used logarithmic transformation to all 3 test results and run another Levene’s. When using the media value the test now results [in] homoskedasticity, however interaction is no longer significant?

Is this the correct way to deal with this problem or is there something I am missing? Thanks in advance to everyone taking their time to help.

Although the question is about ANOVA, I’m going to reframe it in terms of regression, for two reasons:

Discussion of heteroskedasticity is clearer in the context of regression.
For many problems, a regression model is better than ANOVA anyway.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

In [2]:

# install the empiricaldist library, if necessary

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

What is heteroskedasticity?¶

Linear regression is based on a model of the data-generating process where the independent variable y is the sum of

A linear function of x with an unknown slope and intercept, and
Random values drawn from a Gaussian distribution with mean 0 and unknown standard deviation, sigma, that does not depend on x.

If, contrary to the second assumption, sigma depends on x, the data-generating process is heteroskedastic. Some amount of heteroskedasticity is common in real data, but most of the time it’s not a problem, because:

Heteroskedasticity doesn’t bias the estimated parameters of the model, only the standard errors,
Even very strong heteroskedasticity doesn’t affect the standard errors by much, and
For practical purposes we don’t need standard errors to be particularly precise.

To demonstrate, let’s generate some data. First, we’ll draw xs from a normal distribution.

In [3]:

np.random.seed(17)

n = 200
xs = np.random.normal(30, 1, size=n)
xs.sort()

To generate heteroskedastic data, I’ll use interpolate to construct a function where sigma depends on x.

In [4]:

from scipy.interpolate import interp1d

def interpolate(xs, sigma_seq):
    return interp1d([xs.min(), xs.max()], sigma_seq)(xs)

To generate strong heteroskedasticity, I’ll vary sigma over a wide range.

In [5]:

sigmas = interpolate(xs, [0.1, 6.0])
np.mean(sigmas)

Out[5]:

3.126391153924031

Here’s what sigma looks like as a function of x.

In [6]:

plt.plot(xs, sigmas, '.')

decorate(xlabel='x',
         ylabel='sigma')

No description has been provided for this image

Now we can generate ys with variable values of sigma.

In [7]:

ys = xs + np.random.normal(0, sigmas)

If we make a scatter plot of the data, we see a cone shape that indicates heteroskedasticity.

In [8]:

plt.plot(xs, ys, '.')

decorate(xlabel='x', ylabel='y')

Now let’s fit a model to the data.

In [9]:

import statsmodels.api as sm

X = sm.add_constant(xs)
ols_model = sm.OLS(ys, X)
ols_results = ols_model.fit()

intercept, slope = ols_results.params
intercept, slope

Out[9]:

(0.7580177696902339, 0.9672433174107101)

Here’s what the fitted line looks like.

In [10]:

fys = intercept + slope * xs

plt.plot(xs, ys, '.')
plt.plot(xs, fys)

decorate(xlabel='x', ylabel='y')

If we plot the absolute values of the residuals, we can see the heteroskedasticity more clearly.

In [11]:

resid = ys - fys
plt.plot(xs, np.abs(resid), '.')

decorate(xlabel='x', ylabel='absolute residual')

Testing for heteroskedasticity¶

OP mentions using the Levene test for heteroskedasticity, which is used to test whether sigma is different between groups. For continuous values of x and y, we can use the Breusch-Pagan Lagrange Multiplier test:

In [12]:

from statsmodels.stats.diagnostic import het_breuschpagan

_, p_value, _, _ = het_breuschpagan(resid, ols_model.exog)
p_value

Out[12]:

0.0006411829020109725

Or White’s Lagrange Multiplier test:

In [13]:

from statsmodels.stats.diagnostic import het_white

_, p_value, _, _ = het_white(resid, ols_model.exog)
p_value

Out[13]:

0.0019263142806157931

Both tests produce small p-values, which means that if we generate a dataset by a homoskedastic process, there is almost no chance it would have as much heteroskedasticity as the dataset we generated.

If you have never heard of either of these tests, don’t panic — neither had I under I looked them up for this example. And don’t worry about remembering them, because you should never use them again. Like testing for normality, testing for heteroskedasticity is never useful.

Why? Because in almost any real dataset, you will find some heteroskedasticity. So if you test for it, there are only two possible results:

If the heteroskedasticity is small and you don’t have much data, you will fail to reject the null hypothesis.
If the heteroskedasticity is large or you have a lot of data, you will reject the null hypothesis.

Either way, you learn nothing — and in particular, you don’t learn the answer to the question you actually care about, which is whether the heteroskedasticity is so large that the effect on the standard errors is large enough that you should care.

And the answer to that question is almost always no.

Should we care?¶

The dataset we generated has very large heteroskedasticity. Let’s see how much effect that has on the results. Here are the standard errors from simple linear regression:

In [14]:

ols_results.bse

Out[14]:

array([6.06159018, 0.20152957])

Now, there are several ways to generate standard errors that are robust in the presence of heteroskedasticity. One is the Huber-White estimator, which we can compute like this:

In [15]:

robust_se = ols_results.get_robustcov_results(cov_type='HC3')
robust_se.bse

Out[15]:

array([6.73913518, 0.2268012 ])

Another is to use Huber regression.

In [16]:

huber_model = sm.RLM(ys, X, M=sm.robust.norms.HuberT())
huber_results = huber_model.fit()
huber_results.bse

Out[16]:

array([5.92709031, 0.19705786])

Another is to use quantile regression.

In [17]:

quantile_model = sm.QuantReg(ys, X)
quantile_results = quantile_model.fit(q=0.5)
quantile_results.bse

Out[17]:

array([7.6449323 , 0.25417092])

And one more option is a wild bootstrap, which resamples the residuals by multiplying them by a random sequence of 1 and -1. This way of resampling preserves heteroskedasticity, because it only changes the sign of the residuals, not the magnitude, and it maintains the relationship between those magnitudes and x.

In [18]:

from scipy.stats import linregress

def wild_bootstrap():
    resampled = fys + ols_results.resid * np.random.choice([1, -1], size=n)
    res = linregress(xs, resampled)
    return res.intercept, res.slope

We can use wild_bootstrap to generate a sample from the sampling distributions of the intercept and slope.

In [19]:

sample = [wild_bootstrap() for i in range(1001)]

The standard deviation of the sampling distributions is the standard error.

In [20]:

bootstrap_bse = np.std(sample, axis=0)
bootstrap_bse

Out[20]:

array([6.63622313, 0.22341784])

Now let’s put all of the result in a table.

In [21]:

columns = ['SE(intercept)', 'SE(slope)']
index = ['OLS', 'Huber-White', 'Huber', 'quantile', 'bootstrap']
data = [ols_results.bse, robust_se.bse, huber_results.bse, 
        quantile_results.bse, bootstrap_bse]
df = pd.DataFrame(data, columns=columns, index=index)
df.sort_values(by='SE(slope)')

Out[21]:

	SE(intercept)	SE(slope)
Huber	5.927090	0.197058
OLS	6.061590	0.201530
bootstrap	6.636223	0.223418
Huber-White	6.739135	0.226801
quantile	7.644932	0.254171

The standard errors we get from different methods are notably different, but the differences probably don’t matter.

First, I am skeptical of the results from Huber regression. With this kind of heteroskedasticity, the standard errors should be larger than what we get from OLS. I’m not sure what’s the problem is, and I haven’t bothered to find out, because I don’t think Huber regression is necessary in the first place.

The results from bootstrapping and the Huber-White estimator are the most reliable — which suggests that the standard errors from quantile regression are too big.

In my opinion, we don’t need esoteric methods to deal with heteroskedasticity. If heteroskedasticity is extreme, consider using wild bootstrap. Otherwise, just use ordinary least squares.

Now let’s address OP’s headline question, “Is it correct to use logarithmic transformation in order to mitigate heteroskedasticity?”

Log transform help?¶

In some cases, a log transform can reduce or eliminate heteroskedasticity. However, there are several reasons this is not a good idea in general:

As we’ve seen, heteroskedasticity is not a big problem, so it usually doesn’t require any mitigation.
Taking a log transform of one or more variables in a regression model changes the meaning of the model — it hypothesizes a relationship between the variables that might not make sense in context.
Anyway, taking a log transform doesn’t always help.

To demonstrate the last point, let’s see what happens if we apply a log transform to the dependent variable:

In [22]:

log_ys = np.log10(ys)

Here’s what the scatter plot looks like after the transform.

In [23]:

plt.plot(xs, log_ys, '.')

decorate(xlabel='x', ylabel='log10 y')

Here’s what we get if we fit a model to the data.

In [24]:

ols_model_log = sm.OLS(log_ys, X)
ols_results_log = ols_model_log.fit()

intercept, slope = ols_results_log.params
intercept, slope

Out[24]:

(1.072059757295944, 0.013315487289434717)

And here’s the fitted line.

In [25]:

log_fys = intercept + slope * xs

plt.plot(xs, log_ys, '.')
plt.plot(xs, log_fys)

decorate(xlabel='x', ylabel='log10 y')

If we plot the absolute values of the residuals, we can see that the log transform did not entirely remove the heteroskedasticity.

In [26]:

log_resid = log_ys - log_fys
plt.plot(xs, np.abs(log_resid), '.')

decorate(xlabel='x', ylabel='absolute residual on log y')

Which we can confirm by running the tests again (which we should never do).

In [27]:

_, p_value, _, _ = het_breuschpagan(log_resid, ols_model_log.exog)
p_value

Out[27]:

0.002154782205265498

In [28]:

_, p_value, _, _ = het_white(log_resid, ols_model_log.exog)
p_value

Out[28]:

0.006069762292696221

The p-values are bigger, which suggests that the log transform mitigated the heteroskedasticity a little. But if the goal was to eliminate heteroskedasticity, the log transform didn’t do it.

Discussion¶

To summarize:

Heteroskedasticity is common in real datasets — if you test for it, you will often find it, provided you have enough data.
Either way, testing does not answer the question you really care about, which is whether the heteroskedasticity is extreme enough to be a problem.
Plain old linear regression is robust to heteroskedasticity, so unless it is really extreme, it is probably not a problem.
Even in the worst case, heteroskedasticity does not bias the estimated parameters — it only affects the standard errors — and we don’t need standard errors to be particularly precise anyway.
Although a log transform can sometimes mitigate heteroskedasticity, it doesn’t always succeed, and even if it does, it’s usually not necessary.
A log transform changes the meaning of the regression model in ways that might not make sense in context.

So, use a log transform if it makes sense in context, not to mitigate a problem that’s not much of a problem in the first place.

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

In [ ]:

Combining Risks

May 24, 2024 AllenDowney

Here’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

combine_risk

Combining Risks¶

Here’s a question from the Reddit statistics forum.

Bit of a weird one but I’m hoping you’re the community to help. I work in children’s residential care and I’m trying to find a way of better matching potential young people together.

The way we calculate individual risk for a child is

risk = likelihood + impact (R=L+I), so

L4 + I5 = R9

That works well for individuals but I need to work out a good way of calculating a combined risk to place children [in] the home together. I’m currently using the [arithmetic] average but I don’t feel that it works properly as the average is always lower then the highest risk.

I’ll use a fairly light risk as an example, running away from the home. (We call this MFC missing from care) It’s fairly common that one of the kids will run away from the home at some point or another either out of boredom or frustration. If young person A has a risk of 9 and young person B has a risk of 12 the the average risk of MFC in the home would be 10.5

HOWEVER more often then not having two young people that go MFC will often result in more episodes as they will run off together, so having a lower risk rating doesn’t really make sense. Adding the two together to 21 doesn’t really work either though as the likelihood is the thing that increases not necessarily the impact.

I’m a lot better at chasing after run away kids then I am mathematics so please help 😂.

Here’s one way to think about this question: based on background knowledge and experience, OP has qualitative ideas about what happens when we put children at different risks together, and he is looking for a statistical summary that is consistent with these ideas.

The arithmetic mean probably makes sense as a starting point, but it clashes with the observation that if you put two children together who are high risk, they interact in ways that increase the risk. For example, if we put together children with risks 9 and 12, the average is 10.5, and OP’s domain knowledge says that’s too low — it should be more than 12.

At the other extreme, I’ll guess that putting together two low risk children might be beneficial to both — so the combined risk might be lower than either.

And that implies that there is a neutral point somewhere in the middle, where the combined risk is equal to the individual risks.

To construct a summary statistic like that, I suggest a weighted sum of the arithmetic and geometric means. That might sound strange, but I’ll show that it has the properties we want. And it might not be as strange as it sounds — there’s a reason it might make sense.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [18]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

In [19]:

# install the empiricaldist library, if necessary

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

Weighted sum of means¶

The following function computes the arithmetic mean of a sequence of values, which is the sum divided by n.

In [20]:

def amean(xs):
    n = len(xs)
    return np.sum(xs) / n

The following function computes the geometric mean of a sequence, which is the product raised to the power 1/n.

In [21]:

def gmean(xs):
    n = len(xs)
    return np.prod(xs) ** (1/n)

And the following function computes the weighted sum of the arithmetic and geometric means. The constant k determines how much weight we give the geometric mean.

In [22]:

def mean_plus_gmean(*xs, k=1):
    return amean(xs) + k * (gmean(xs) - 4)

The value 4 determines the neutral point. So if we put together two people with risk 4, the combined average is 4.

In [23]:

mean_plus_gmean(4, 4)

Out[23]:

4.0

Above the neutral point, there is a penalty if we put together two children with higher risks.

In [24]:

mean_plus_gmean(5, 5)

Out[24]:

6.0

In that case, the combined risk is higher than the individual risks. Below the neutral point, there is a bonus if we put together two children with low risks.

In [25]:

mean_plus_gmean(3, 3)

Out[25]:

2.0

In that case, the combined risk is less than the individual risks.

If we combine low and high risks, the discrepancy brings the average down a little.

In [26]:

mean_plus_gmean(3, 5)

Out[26]:

3.872983346207417

In the example OP presented, where we put together two people with high risk, the penalty is substantial.

In [27]:

mean_plus_gmean(9, 12)

Out[27]:

16.892304845413264

If that penalty seems too high, we can adjust the weight, k, and the neutral point accordingly.

This behavior extends to more than two people. If everyone is neutral, the result is neutral.

In [28]:

mean_plus_gmean(4, 4, 4)

Out[28]:

3.9999999999999996

If you add one person with higher risk, there’s a moderate penalty, compared to the arithmetic mean.

In [29]:

mean_plus_gmean(4, 4, 5), amean([4, 4, 5])

Out[29]:

(4.6422027133971, 4.333333333333333)

With two higher risk people, the penalty is bigger.

In [30]:

mean_plus_gmean(4, 5, 5), amean([4, 5, 5])

Out[30]:

(5.308255500279445, 4.666666666666667)

And with three it is even bigger.

In [31]:

mean_plus_gmean(5, 5, 5), amean([5, 5, 5])

Out[31]:

(5.999999999999999, 5.0)

Does this make any sense?¶

The idea behind this suggestion is logistic regression with an interaction term. Let me explain where that comes from. OP explained:

The way we calculate individual risk for a child is

risk = likelihood + impact (R=L+I), so

L4 + I5 = R9

At first I thought it was strange to add a likelihood and an impact score, Thinking about expected value, I thought it should be the product of a probability and a cost. But if both are on a log scale, adding these logarithms is like multiplying probability by impact on a linear scale, so that makes more sense.

And if the scores are consistent with something like a log-odds scale, we can see a connection with logistic regression. If r1 and r2 are risk scores, we can imagine a regression equation that looks like this, where p is the probability of an outcome like “missing from care”:

logit(p) = a r1 + b r2 + c r1 r2

In this equation, logit(p) is the combined risk score, a, b, and c are unknown parameters, and the product r1 r2 is an interaction term that captures the tendency of high risks to amplify each other.

With enough data, we could estimate the unknown parameters. Without data, the best we can do is chose values that make the results consistent with expectations.

Since r1 and r2 are interchangeable, they have to have the same parameter. And since the whole risk scale has an unspecified zero point, we can set it a and b to 1/2. Which means there is only one parameter left, the weight of the interaction term.

logit(p) = (r_1 + r2) / 2 + k r1 r2

Now we can see that the first term is the arithmetic mean and the second term is close to the geometric mean, but without the square root.

So the function I suggested — the weighted sum of arithmetic and weighted means — is not identical to the logistic model, but it is motivated by it.

With this rationale in mind, we might consider a revision: rather than add the likelihood and impact scores, and then compute the weighted sum of means, it might make more sense to separate likelihood and impact, compute the weighted sum of the means of the likelihoods, and then add back the impact.

Computing by hand¶

In case the Python code makes it hard to see what’s going on, let’s work an example by hand. Suppose r1 is 9 and r2 is 12.

In [32]:

r1 = 9
r2 = 12

Here’s the arithmetic mean.

In [33]:

m1 = (9 + 12) / 2
m1

Out[33]:

10.5

Here’s the geometric mean.

In [34]:

m2 = np.sqrt(9 * 12)
m2

Out[34]:

10.392304845413264

And here’s how we combine them.

In [35]:

k = 1
combined_risk = m1 + k * (m2 - 4)
combined_risk

Out[35]:

16.892304845413264

Discussion¶

This question got my attention because OP is working on a challenging and important problem — and they provided useful context. It’s an intriguing idea to define something that is intuitively like an average, but is not always bounded between the minimum and maximum of the data.

If we think strictly about generalized means, that’s not possible. But if we think in terms of logarithms, regression, and interaction terms, we find a way.

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

Bertrand’s Boxes

May 20, 2024 AllenDowney

An early draft of Probably Overthinking It included two chapters about probability. I still think they are interesting, but the other chapters are really about data, and the examples in these chapters are more like brain teasers — so I’ve saved them for another book. Here’s an excerpt from the chapter on Bayes theorem.

In 1889 Joseph Bertrand posed and solved one of the oldest paradoxes in probability. But his solution is not quite correct – it is right for the wrong reason.

The original statement of the problem is in his Calcul des probabilités (Gauthier-Villars, 1889). As a testament to the availability of information in the 21st century, I found a scanned copy of the book online and pasted a screenshot into an online OCR server. Then I pasted the French text into an online translation service. Here is the result, which I edited lightly for clarity:

Three boxes are identical in appearance. Each has two drawers, each drawer contains a medal. The medals in the first box are gold; those in the second box, silver; the third box contains a gold medal and a silver medal.

We choose a box; what is the probability of finding, in its drawers, a gold coin and a silver coin?

Three cases are possible and they are equally likely because the three chests are identical in appearance. Only one case is favorable. The probability is 1/3.

Having chosen a box, we open a drawer. Whatever medal one finds there, only two cases are possible. The drawer that remains closed may contain a medal whose metal may or may not differ from that of the first. Of these two cases, only one is in favor of the box whose parts are different. The probability of having got hold of this set is therefore 1/2.

How can it be, however, that it will be enough to open a drawer to change the probability and raise it from 1/3 to 1/2? The reasoning cannot be correct. Indeed, it is not.

After opening the first drawer, two cases remain possible. Of these two cases, only one is favorable, this is true, but the two cases do not have the same likelihood.

If the coin we saw is gold, the other may be silver, but we would be better off betting that it is gold.

Suppose, to show the obvious, that instead of three boxes we have three hundred. One hundred contain two gold medals, one hundred and two silver medals and one hundred one gold and one silver. In each box we open a drawer, we see therefore three hundred medals. A hundred of them are in gold and a hundred in silver, that is certain; the hundred others are doubtful, they belong to boxes whose parts are not alike: chance will regulate the number.

We must expect, when opening the three hundred drawers, to see less than two hundred gold coins the probability that the first that appears belongs to one of the hundred boxes of which the other coin is in gold is therefore greater than 1/2.

Now let me translate the paradox one more time to make the apparent contradiction clearer, and then we will resolve it.

Suppose we choose a random box, open a random drawer, and find a gold medal. What is the probability that the other drawer contains a silver medal? Bertrand offers two answers, and an argument for each:

Only one of the three boxes is mixed, so the probability that we chose it is 1/3.
When we see the gold coin, we can rule out the two-silver box. There are only two boxes left, and one of them is mixed, so the probability we chose it is 1/2.

As with so many questions in probability, we can use Bayes theorem to resolve the confusion. Initially the boxes are equally likely, so the prior probability for the mixed box is 1/3.

When we open the drawer and see a gold medal, we get some information about which box we chose. So let’s think about the likelihood of this outcome in each case:

If we chose the box with two gold medals, the likelihood of finding a gold medal is 100%.
If we chose the box with two silver medals, the likelihood is 0%.
And if we chose the box with one of each, the likelihood is 50%.

Putting these numbers into a Bayes table, here is the result:

	Prior	Likelihood	Product	Posterior
Two gold	1/3	1	1/3	2/3
Two silver	1/3	0	0	0
Mixed	1/3	1/2	1/6	1/3

The posterior probability of the mixed box is 1/3. So the first argument is correct. Initially, the probability of choosing the mixed box is 1/3 – opening a drawer and seeing a gold coin does not change it. And the Bayesian update tells us why: if there are two gold coins, rather than one, we are twice as likely to see a gold coin.

The second argument is wrong because it fails to take into account this difference in likelihood. It’s true that there are only two boxes left, but it is not true that they are equally likely. This error is analogous to the base rate fallacy, which is the error we make if we only consider the likelihoods and ignore the prior probabilities. Here, the second argument is wrong because it commits the a “likelihood fallacy” – considering only the prior probabilities and ignoring the likelihoods.

Right for the wrong reason

Bertrand’s resolution of the paradox is correct in the sense that he gets the right answer in this case. But his argument is not valid in general. He asks, “How can it be, however, that it will be enough to open a drawer to change the probability…”, implying that it is impossible in principle.

But opening the drawer does change the probabilities of the other two boxes. Having seen a gold coin, we rule out the two-silver box and increase the probability of the two-gold box. So I don’t think we can dismiss the possibility that opening the drawer could change the probability of the mixed box. It just happens, in this case, that it does not.

Let’s consider a variation of the problem where there are three drawers in each box: the first box contains three gold medals, the second contains three silver, and the third contains two gold and one silver.

In that case the likelihood of seeing a gold coin is each case is 1, 0, and 2/3, respectively. And here’s what the update looks like:

	Prior	Likelihood	Product	Posterior
Three gold	1/3	1	1/3	3/5
Three silver	1/3	0	0	0
Two gold, one silver	1/3	2/3	2/9	2/5

Now the posterior probability of the mixed box is 2/5, which is higher than the prior probability, which was 1/3. In this example, opening the drawer provides evidence that changes the probabilities of all three boxes.

I think there are two lessons we can learn from this example. The first is, don’t be too quick to assume that all cases are equally likely. The second is that new information can change probabilities in ways that are not obvious. The key is to think about the likelihoods.

Estimation with Small Samples

May 14, 2024 AllenDowney

Here’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

gauss_bayes

Estimation with Small Samples¶

Here’s a question from the Reddit statistics forum.

Hey, so imagine I only have 6 samples from a value that has a normal distribution. Can I estimate the range of likely distributions from those 6?

Let’s be more specific. I’m considering the accuracy of a blood testing device. I took 6 samples of my blood at the same time from the same vein and gave them to the machine. The results are not all the same (as expected), indicating the device’s inherent level of imprecision.

So, I’m wondering if there’s a way to estimate the range of possibilities of what I would see if I could give 100 or 1000 samples?

I’m comfortable assuming a normal distribution around the “true” value. Is there any stats method to guesstimate the range of likely values for sigma? Or would I just need to drain my blood dry to get 1000 samples to figure that out?

Fyi, not a statistician.

Because the sample size is so small, this question cries out for a Bayesian approach. Why?

Bayesian methods do a good job of taking advantage of background information and extracting as much information as possible from the data,
With a small sample size, the uncertainty of the result is large, so it is important to quantify it, and
The motivating question is explicitly about making probabilistic predictions, which is what Bayesian methods do and classical methods don’t.

As an example, OP suggested looking at blood potassium (K+) levels, with the following data:

4.0, 3.9, 4.0, 4.0, 4.7, 4.2 (mmol/L)

OP also explained that the blood samples were collected “continuously from a single puncture … within seconds, not minutes,” so we don’t expect the true level to change much between samples. In that case, the variation in the measurements would largely reflect the precision of the testing device.

With that assumption, let’s see what we can do.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

In [2]:

# install the empiricaldist library, if necessary

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

Bayesian estimation¶

Here’s the data.

In [3]:

data = [4.0, 3.9, 4.0, 4.0, 4.7, 4.2]
np.mean(data), np.std(data)

Out[3]:

(4.133333333333334, 0.2687419249432851)

Now we need prior distributions for mu and sigma. For the prior distribution of mu we can use the distribution in the general population. The “normal range” for potassium — which is usually the 5th and 95th percentiles of the population — is 3.5 to 5.4 mmol/L. So we can construct a normal distribution with these percentiles.

In [4]:

from scipy.stats import norm

hyper_mu = (3.5 + 5.4) / 2
hyper_sigma = 0.6

dist = norm(hyper_mu, hyper_sigma)
dist.ppf([0.05, 0.95])

Out[4]:

array([3.46308782, 5.43691218])

Now I’ll make a Pmf object that represents this distribution.

In [5]:

mus = np.linspace(2, 7, 51)
ps = dist.pdf(mus)

In [6]:

from empiricaldist import Pmf

prior_mu = Pmf(ps, mus)
prior_mu.index.name = 'mu'
prior_mu.normalize()

Out[6]:

9.99977690541445

In [7]:

prior_mu.plot(label='prior')
decorate(ylabel='PDF')

For the prior distribution of sigma, we could use some background information about the device. For example, based on similar devices, what values of sigma would be expected? I’ll use a gamma distribution to construct a prior PMF, but it could be anything.

In [8]:

from scipy.stats import gamma

alpha = 2
beta = 0.5

sigmas = np.linspace(0.01, 5, 101)
ps = gamma(alpha, scale=beta).pdf(sigmas)

In [9]:

prior_sigma = Pmf(ps, sigmas)
prior_sigma.index.name = 'sigma'
prior_sigma.normalize()

Out[9]:

20.03019853279729

In [10]:

prior_sigma.plot(label='prior')
decorate(ylabel='PDF')

This distribution represents the background knowledge that we expect the standard deviation to be less than 2, and we’d be surprised by anything much higher than that.

I’ll use this function to make a Pandas DataFrame to represent the joint prior.

In [11]:

def make_joint(s1, s2):
    """Compute the outer product of two Series.

    First Series goes across the columns;
    second goes down the rows.

    s1: Series
    s2: Series

    return: DataFrame
    """
    X, Y = np.meshgrid(s1, s2, indexing='ij')
    return pd.DataFrame(X*Y, index=s1.index, columns=s2.index)

In [12]:

prior = make_joint(prior_mu, prior_sigma)
prior.shape

Out[12]:

(51, 101)

I’ll use the following function to make a contour plot of the prior.

In [13]:

def plot_contour(joint):
    """Plot a joint distribution.

    joint: DataFrame representing a joint PMF
    """
    low = joint.to_numpy().min()
    high = joint.to_numpy().max()
    levels = np.linspace(low, high, 6)
    levels = levels[1:]

    cs = plt.contour(joint.columns, joint.index, joint, levels=levels, linewidths=1)
    decorate(xlabel=joint.columns.name,
             ylabel=joint.index.name)
    return cs

In [14]:

plot_contour(prior)
decorate()

The update¶

To use the data to update the prior, we have to compute the likelihood of the data for each possible pair of mu and sigma. Can can do that by creating a 3-D mesh with the possible values of mu and sigma, and the observed values of the data.

In [15]:

MU, SIGMA, DATA = np.meshgrid(prior_mu.index, prior_sigma.index, data,
                              indexing='ij')
MU.shape

Out[15]:

(51, 101, 6)

Now we can evaluate the normal distribution for each data point and each pair of mu and sigma.

In [16]:

densities = norm(MU, SIGMA).pdf(DATA)
densities.shape

Out[16]:

(51, 101, 6)

The likelihood of each pair is the product of the densities for the data points.

In [17]:

likelihood = densities.prod(axis=2)
likelihood.shape

Out[17]:

(51, 101)

The unnormalized posterior is the product of the prior and the likelihood.

In [18]:

posterior = prior * likelihood

We can normalize it like this.

In [19]:

def normalize(joint):
    """Normalize a joint distribution.

    joint: DataFrame
    """
    prob_data = joint.to_numpy().sum()
    joint /= prob_data
    return prob_data

In [20]:

normalize(posterior)
posterior.shape

Out[20]:

(51, 101)

And here’s what the result looks like.

In [21]:

plot_contour(posterior)
decorate()

Here’s the posterior distribution of sigma compared to the prior.

In [22]:

def marginal(joint, axis):
    """Compute a marginal distribution.

    axis=0 returns the marginal distribution of the first variable
    axis=1 returns the marginal distribution of the second variable

    joint: DataFrame representing a joint distribution
    axis: int axis to sum along

    returns: Pmf
    """
    return Pmf(joint.sum(axis=axis))

In [23]:

posterior_sigma = marginal(posterior, 0)

In [24]:

prior_sigma.plot(color='gray', label='prior')
posterior_sigma.plot(label='posterior')

decorate(xlabel='sigma', ylabel='PDF')

The posterior mean of sigma is about 0.4, somewhat higher than the standard deviation of the data.

In [25]:

posterior_sigma.mean(), np.std(data)

Out[25]:

(0.40440859064493323, 0.2687419249432851)

Predictions¶

The last part of OP’s question is “So, I’m wondering if there’s a way to estimate the range of possibilities of what I would see if I could give 100 or 1000 samples?”

We can simulate a future experiment with larger sample size by drawing random pairs from the joint posterior distribution and generating simulated measurements. That’s easier to do if we stack the posterior PMF.

In [26]:

posterior_pmf = Pmf(posterior.stack())
posterior_pmf.head()

Out[26]:

		probs
mu	sigma
2.0	0.0100	0.0
	0.0599	0.0
	0.1098	0.0

Now we can use NumPy’s choice function to choose a random pair of mu and sigma. For each random pair, we can generate 1000 simulated measurements from a normal distribution, and plot the distribution of the results.

In [27]:

for i in range(11):
    mu, sigma = np.random.choice(posterior_pmf.index, p=posterior_pmf)
    sample = np.random.normal(mu, sigma, size=1000)
    sns.kdeplot(sample, alpha=0.3)
    
decorate(xlabel='K+ (mmol/L)')

Each line shows the distribution of a possible sample of 1000 measurements. You can see that they vary in both location and spread, due to the uncertainty represented by the posterior distribution of mu and sigma.

Discussion¶

The normal distribution is probably a good enough model for this data, but for some pairs of mu and sigma in the prior, the normal distribution extends into negative values with non-negligible probability — and for measurements like these, negative values are nonsensical.

An alternative would be to run the whole calculation with the logarithms of the measurements. In effect, this would model the distribution of the data as lognormal, which is generally a good choice for measurements like these.

In this example, the sample size is small, so the choice of the prior is important. I suspect there is more background information we could use to make a better choice for the prior distribution of sigma.

This example demonstrates the use of grid methods for Bayesian statistics. For many common statistical questions, grid methods like this are simple and fast to compute, especially with libraries like NumPy and SciPy. And they make it possible to answer many useful probabilistic questions.

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

Destructive Testing

May 7, 2024 AllenDowney

Here’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

sample_size

Sample Size Selection¶

Here’s a question from the Reddit statistics forum.

Hi Redditors, I am a civil engineer trying to solve a statistical problem for a current project I have. I have a pavement parking lot 125,000 sf in size. I performed nondestructive testing to render an opinion about the areas experiencing internal delimitation not observable from the surface. Based on preliminary testing, it was determined that 9% of the area is bad, and 11% of the total area I am unsure about (nonconclusive results if bad or good), and 80% of the area is good. I need to verify all areas using destructive testing, I will take out slabs 2 sf in size. my question is how many samples do I need to take from each area to confirm the results with 95% confidence interval?

There are elements of this question that are not clear, and OP did not respond to follow-up questions. But the question is generally about sample size selection, so let’s talk about that.

If the parking lot is 125,000 sf and each sample is 2 sf, we can imagine dividing the total area into 62,500 test patches. Of those, some unknown proportion are good and the rest are bad.

In reality, there is probably some spatial correlation — if a patch is bad, the nearby patches are more likely to be bad. But if we choose a sample of patches entirely at random, we can assume that they are independent. In that case, we can estimate the proportion of patches that are good and quantify the precision of that estimate by computing a confidence interval.

Then we can choose a sample size that meets some requirement. For example, we might want the 95% confidence interval needs to be smaller than a given threshold, or we might want to bound the probability that the proportion falls below some threshold.

But let’s start by estimating proportions and computing confidence intervals.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

In [2]:

# install the empiricaldist library, if necessary

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

The beta-binomial model¶

Based on preliminary testing, it is likely that the proportion of good patches is between 80% and 90%. We can take advantage of that information by using a beta distribution as a prior and updating it with the data. Here’s a prior distribution that seems like a reasonable choice, given the background information.

In [3]:

from scipy.stats import beta as beta_dist

prior = beta_dist(8, 2)
prior.mean()

Out[3]:

0.8

The prior mean is at the low end of the likely range, so the results will be a little conservative. Here’s what the prior looks like.

In [4]:

def plot_dist(dist, **options):
    qs = np.linspace(0, 1, 101)
    ps = dist.pdf(qs)
    ps /= ps.sum()
    plt.plot(qs, ps, **options)

In [5]:

plot_dist(prior, color='gray', label='prior')

decorate(xlabel='Proportion good',
         ylabel='PDF')

This prior leaves open the possibility of values below 80% and greater than 90%, but it assigns them lower probabilities.

Now let’s generate a hypothetical dataset to see what the update looks like. Suppose the actual percentage of good patches is 90%, and we sample n=10 of them.

In [6]:

def generate_data(n, p):
    yes = int(round(n * p))
    no = n - yes
    return yes, no

And suppose that, in line with expectations, 9 out of 10 tests are good.

In [7]:

yes, no = generate_data(n=10, p=0.9)
yes, no

Out[7]:

(9, 1)

Under the beta-binomial model, computing the posterior is easy.

In [8]:

def update(dist, yes, no):
    a, b = dist.args
    return beta_dist(a + yes, b + no)

Here’s how we run the update.

In [9]:

posterior10 = update(prior, yes, no)
posterior10.mean()

Out[9]:

0.85

The posterior mean is 85%, which is half way between the prior mean and the proportion observed in the data.

Here’s what the posterior distribution looks like, compared to the prior.

In [10]:

plot_dist(prior, color='gray', label='prior')
plot_dist(posterior10, label='posterior10')

decorate(xlabel='Proportion good',
         ylabel='PDF')

Given the posterior distribution, we can use ppf, which computes the inverse CDF, to compute a confidence interval.

In [11]:

def confidence_interval(dist, percent=95):
    low = (100 - percent) / 200
    high = 1 - low
    ci = dist.ppf([low, high])
    return ci

Here’s the result for this example.

In [12]:

confidence_interval(posterior10)

Out[12]:

array([0.66862334, 0.96617375])

With a sample size of only 10, the confidence interval is still quite wide — that is, the estimate of the proportion is not precise.

More data?¶

Now let’s run the same analysis with a sample size of n=100.

In [13]:

yes, no = generate_data(n=100, p=0.9)
posterior100 = update(prior, yes, no)
posterior100.mean()

Out[13]:

0.8909090909090909

With a larger sample size, the posterior mean is closer to the proportion observed in the data. And the posterior distribution is narrower, which indicates greater precision.

In [14]:

plot_dist(prior, color='gray', label='prior')
plot_dist(posterior10, label='posterior10')
plot_dist(posterior100, label='posterior100')

decorate(xlabel='Proportion good',
         ylabel='PDF')

The confidence interval is much smaller.

In [15]:

confidence_interval(posterior100)

Out[15]:

array([0.82660267, 0.94180387])

If we need more precision than that, we can increase the sample size more. If we don’t need that much precision, we can decrease it.

With some math, we could compute the sample size algorithmically, but a simple alternative is to run this analysis with different sample sizes until we get the results we need.

But what about that prior?¶

Some people don’t like using Bayesian methods because they think it is more objective to ignore perfectly good background information, even in cases like this where they come from preliminary testing that is clearly applicable.

To satisfy them, we can run the analysis again with a uniform prior, which is not actually more objective, but it seems to make people happy.

In [16]:

uniform_prior = beta_dist(1, 1)
uniform_prior.mean()

Out[16]:

0.5

The mean of the uniform prior is 50%, so it is more pessimistic. Here’s the update with n=10.

In [17]:

yes, no = generate_data(n=10, p=0.9)
uniform_posterior10 = update(uniform_prior, yes, no)
uniform_posterior10.mean()

Out[17]:

0.8333333333333334

Now let’s compare the posterior distributions with the informative prior and the uniform prior.

In [18]:

plot_dist(uniform_prior, color='gray', label='uniform prior')
plot_dist(posterior10, color='C1', label='posterior10')
plot_dist(uniform_posterior10, color='C4', label='uniform posterior10')

decorate(xlabel='Proportion good',
         ylabel='PDF')

With the informative prior, the posterior distribution is a little narrower — an estimate that uses background information is more precise.

Let’s do the same thing with n=100.

In [19]:

uniform_prior = beta_dist(1, 1)
yes, no = generate_data(n=100, p=0.9)
uniform_posterior100 = update(uniform_prior, yes, no)

In [20]:

plot_dist(uniform_prior, color='gray', label='uniform prior')
plot_dist(posterior100, color='C1', label='posterior100')
plot_dist(uniform_posterior100, color='C4', label='uniform posterior100')

decorate(xlabel='Proportion good',
         ylabel='PDF')

With a larger sample size, the choice of the prior has less effect — the posterior distributions are almost the same.

Discussion¶

Sample size analysis is a good thing to do when you are designing experiments, because it requires you to

Make a model of the data-generating process,
Generate hypothetical data, and
Specify ahead of time what analysis you plan to do.

It also gives you a preview of what the results might look like, so you can think about the requirements. If you do these things before running an experiment, you are likely to clarify your thinking, communicate better, and improve the data collection process and analysis.

Sample size analysis can also help you choose a sample size, but most of the time that’s determined by practical considerations, anyway. I mean, how many holes do you think they’ll let you put in that parking lot?

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

In [ ]:

The mean of a Likert scale?

May 3, 2024 AllenDowney

Here’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

likert_mean

Likert scale analysis¶

Here’s a question from the Reddit statistics forum.

I have collected data regarding how individuals feel about a particular program. They reported their feelings on a scale of 1-5, with 1 being Strongly Disagree, 2 being Disagree, 3 being Neutral, 4 being Agree, and 5 being Strongly Agree.

I am looking to analyze the data for averages responses, but I see that a basic mean will not do the trick. I am looking for very simple statistical analysis on the data. Could someone help out regarding what I would do?

It sounds like OP has heard the advice that you should not compute the mean of values on a Likert scale. The Likert scale is ordinal, which means that the values are ordered, but it is not an interval scale, because the distances between successive points are not necessarily equal.

For example, if we imagine that “Neutral” maps to 0 on a number line and “Agree” maps to 1, it’s not clear where we should place “Strongly agree”. And an appropriate mapping might not be symmetric — for example, maybe the people who choose “Strongly agree” are content, but the people who choose “Strongly disagree” are angry. In an arithmetic mean, they would cancel each other out — but that might erase an important distinction.

Nevertheless, I think an absolute prohibition on computing means is too strong. I’ll show some examples where I think it’s a reasonable thing to do — but I’ll also suggest alternatives that might be better.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

In [2]:

# install the empiricaldist library, if necessary

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

Political views¶

As an example, I’ll use data from the General Social Survey (GSS), which I have resampled to correct for stratified sampling.

In [3]:

download('https://github.com/AllenDowney/DataQnA/raw/main/data/gss_qna_extract.hdf')

Out[3]:

'gss_qna_extract.hdf'

In [4]:

gss = pd.read_hdf('gss_qna_extract.hdf', 'gss')

The variable we’ll start with is polviews, which contains responses to this question:

We hear a lot of talk these days about liberals and conservatives. I’m going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal–point 1–to extremely conservative–point 7. Where would you place yourself on this scale?

This is not a Likert scale, specifically, but it is ordinal. Here is the distribution of responses:

In [5]:

from utils import values

values(gss['polviews'])

Out[5]:

polviews
1.0     2095
2.0     7309
3.0     7799
4.0    24157
5.0     9816
6.0     9612
7.0     2145
NaN     9457
Name: count, dtype: int64

Here’s the mapping from numerical values to the options that were shown to respondents.

In [6]:

polviews_dict = {
    1: "Extremely liberal",
    2: "Liberal",
    3: "Slightly liberal",
    4: "Moderate",
    5: "Slightly conservative",
    6: "Conservative",
    7: "Extremely conservative",
}

As always, it’s a good idea to visualize the distribution before we compute any summary statistics. I’ll use Pmf from empiricaldist to compute the PMF of the values.

In [7]:

from empiricaldist import Pmf

pmf_polviews = Pmf.from_seq(gss['polviews'])

Here’s what it looks like.

In [8]:

labels = list(polviews_dict.values())
plt.barh(labels, pmf_polviews)
decorate(xlabel='PMF')

The modal value is “Moderate” and the distribution is roughly symmetric, so I think it’s reasonable to compute a mean. For example, suppose we want to know how self-reported political alignment has changed over time. We can compute the mean in each year like this:

In [9]:

mean_series = gss.groupby('year')['polviews'].mean()

And plot it like this.

In [10]:

from utils import plot_series_lowess

plot_series_lowess(mean_series, 'C2')
decorate(xlabel='Year',
         title='Mean political alignment')

I used LOWESS to plot a local regression line, which makes long-term trends easier to see. It looks like the center of mass trended toward conservative from the 1970s into the 1990s and trended toward liberal since then.

When you compute any summary statistic, you lose information. To see what we might be missing, we can use a normalized cross-tabulation to compute the distribution of responses in each year.

In [11]:

xtab = pd.crosstab(gss['year'], gss['polviews'], normalize='index')
xtab.head()

Out[11]:

polviews	1.0	2.0	3.0	4.0	5.0	6.0	7.0
year
1974	0.021908	0.142049	0.149117	0.380212	0.157597	0.127915	0.021201
1975	0.040057	0.131617	0.148069	0.386266	0.145923	0.115880	0.032189
1976	0.021877	0.139732	0.123500	0.398024	0.147495	0.145378	0.023994
1977	0.025085	0.122712	0.145085	0.402712	0.164746	0.111186	0.028475
1978	0.014463	0.096419	0.175620	0.384986	0.182507	0.128788	0.017218

And we can use a heat map to visualize the results.

In [12]:

sns.heatmap(xtab.T, cmap='cividis_r')
plt.gca().invert_yaxis()

Based on the heat map, it seems like the general shape of the distribution has not changed much — so the mean is probably a good way to make comparisons over time.

However, it is hard to interpret the mean in absolute terms. For example, in the most recent data, the mean is about 4.1.

In [13]:

gss.query('year == 2022')['polviews'].mean()

Out[13]:

4.099445902595509

Since 4.0 maps to “Moderate”, we can say that the center of mass is slightly on the conservative side of moderate, but it’s hard to say what a difference of 0.1 means on this scale.

As an alternative, we could add up the percentage who identify as conservative or liberal, with or without an adverb.

In [14]:

con = xtab[[5, 6, 7]].sum(axis=1) * 100
lib = xtab[[1, 2, 3]].sum(axis=1) * 100

And plot those percentages over time.

In [15]:

plot_series_lowess(con, 'C3', label='Conservative')
plot_series_lowess(lib, 'C0', label='Liberal')

decorate(xlabel='Year',
         ylabel='Percent',
         title='Percent identifying as conservative or liberal')

Or we could plot the difference in percentage points.

In [16]:

diff = con - lib
plot_series_lowess(diff, 'C4')

decorate(xlabel='Year',
         ylabel='Percentage points',
         title='Difference %conservative - %liberal')

This figure shows the same trends we saw by plotting the mean, but the y-axis is more interpretable — for example, we could report that, at the peak of the Reagan era, conservatives outnumbered liberals by 10-15 percentage points.

Standard deviation¶

Suppose we are interested in polarization, so we want to see if the spread of the distribution has changed over time. Would it be OK to compute the standard deviation of the responses? As with the mean, my answer is yes and no.

First, let’s see what it looks like.

In [17]:

std_series = gss.groupby('year')['polviews'].std()

In [18]:

plot_series_lowess(std_series, 'C3')
decorate(xlabel='Year',
         ylabel='Standard deviation')

The standard deviation is easy to compute, and it makes it easy to see the long-term trend. If we interpret the spread of the distribution as a measure of polarization, it looks like it has increased in the last 30 years.

But it is not easy to interpret this result in context. If it increased from about 1.35 to 1.5, is that a lot? It’s hard to say.

As an alternative, let’s compute the mean absolute deviation (MAD), which we can think of like this: if we choose two people at random, how much will they differ on this scale, on average?

A quick way to estimate MAD is to draw two samples from the responses and compute the mean pairwise distance.

In [19]:

def sample_mad(series, size=1000):
    data = series.dropna()
    if len(data) == 0:
        return np.nan
    sample1 = np.random.choice(data, size=size, replace=True)
    sample2 = np.random.choice(data, size=size, replace=True)
    mad = np.abs(sample1 - sample2).mean()
    return mad

In [20]:

sample_mad(gss['polviews'])

Out[20]:

1.515

The result is about 1.5 points, which is bigger than the distance from moderate to slightly conservative, and smaller than the distance from moderate to conservative.

Rather than sampling, we can compute MAD deterministically by forming the joint distribution of response pairs and computing the expected value of the distances. For this computation, it is convenient to use NumPy functions for outer product and outer difference.

In [21]:

def outer_mad(series):
    pmf = Pmf.from_seq(series)
    if len(pmf) == 0:
        return np.nan
    ps = np.outer(pmf, pmf)
    qs = np.abs(np.subtract.outer(pmf.index, pmf.index))
    return np.sum(ps * qs)

Again, the result is about 1.5 points.

In [22]:

outer_mad(gss['polviews'])

Out[22]:

1.5360753915376488

Now we can see how this value has changed over time.

In [23]:

mad_series = gss.groupby('year')['polviews'].apply(outer_mad)

Here’s the result, along with the standard deviation.

In [24]:

plt.figure(figsize=(6, 6))
plt.subplot(2, 1, 1)
plot_series_lowess(std_series, 'C3')
decorate(xlabel='',
         ylabel='Standard deviation')

plt.subplot(2, 1, 2)
plot_series_lowess(mad_series, 'C4')

decorate(xlabel='Year',
         ylabel='Mean absolute difference')

The two figures tell the same story — polarization is increasing. But the MAD is easier to interpret. In the 1970s, if you chose two people at random, they would differ by less than 1.5 points on average. Now the difference would be almost 1.7 points. Considering that the difference between a moderate and a conservative is 2 points, it seems like we should still be able to get along.

I think MAD is more interpretable than standard deviation, but it is based on the same assumption that the points on the scale are equally spaced. In most cases, that’s not an assumption we can easily check, but in this example, maybe we can.

For another project, I selected 15 questions in the GSS where conservatives and liberals are most likely to disagree, and used them to estimate the number of conservative responses from each respondent. The following figure shows the average number of conservative responses to the 15 questions for each point on the self-reported scale.

In [25]:

from utils import xticks

conservatism = gss.groupby('polviews')['conservatism'].mean()
conservatism.plot()
xticks(polviews_dict, rotation=30)
decorate(xlabel='Political alignment',
         ylabel='Conservative responses')

The result is close to a straight line, which suggests that the assumption of equal spacing is not bad in this case.

When is the mean bad?¶

In the examples so far, computing the mean and standard deviation of a scale variable is not necessarily the best choice, but it could be a reasonable choice. Now we’ll see an example where it is probably a bad choice.

The variable homosex contains responses to this question:

What about sexual relations between two adults of the same sex–do you think it is always wrong, almost always wrong, wrong only sometimes, or not wrong at all?

If the wording of the question seems loaded, remember that many of the core questions in the GSS were written in the 1970s. Here is the encoding of the responses.

In [26]:

homosex_dict = {
    1: "Always wrong",
    2: "Almost always wrong",
    3: "Sometimes wrong",
    4: "Not wrong at all",
    5: "Other",
}

And here are the value counts.

In [27]:

values(gss['homosex'])

Out[27]:

homosex
1.0    24856
2.0     1857
3.0     2909
4.0    12956
5.0       94
NaN    29718
Name: count, dtype: int64

Before we do anything else, let’s look at the distribution of responses.

In [28]:

pmf_homosex = Pmf.from_seq(gss['homosex'])

In [29]:

labels = list(homosex_dict.values())
plt.barh(labels, pmf_homosex)
decorate(xlabel='PMF')

There are several reasons it’s a bad idea to summarize this distribution by computing the mean. First, one of the responses is not ordered. If we include “Other” in the mean, the result is meaningless.

In [30]:

gss['homosex'].mean()         # total nonsense

Out[30]:

2.099526621672291

If we exclude “Other”, the remaining responses are ordered, but arguably not evenly spaced on a spectrum of opinion.

In [31]:

gss['homosex'].replace(5, np.nan).mean()         # still nonsense

Out[31]:

2.0931232091690544

If we compute a mean and report that the average response is somewhere between “Sometimes wrong” and “Almost always wrong”, that is not an effective summary of the distribution.

The distribution of results is strongly bimodal — most people are either accepting of homosexuality or not. And that suggests a better way to summarize the distribution: we can simply report the fraction of respondents who choose one extreme or the other.

I’ll start by creating a binary variable that is 1 for respondents who chose “Not wrong at all”, 0 for the other responses, and NaN for people who were not asked the question, did not respond, or chose “Other”.

In [32]:

homosex_recode = {
    1: 0,
    2: 0,
    3: 0,
    4: 1,
    5: np.nan,
}

gss['homosex_recode'] = gss['homosex'].replace(homosex_recode)

In [33]:

values(gss['homosex_recode'])

Out[33]:

homosex_recode
0.0    29622
1.0    12956
NaN    29812
Name: count, dtype: int64

The mean of this variable is the fraction of respondents who chose “Not wrong at all”.

In [34]:

gss['homosex_recode'].mean()

Out[34]:

0.30428859974634787

Now we can see how this fraction has changed over time.

In [35]:

percent_series = gss.groupby('year')['homosex_recode'].mean() * 100

In [36]:

plot_series_lowess(percent_series, 'C4')

decorate(xlabel='Year',
         ylabel='Percent',
         title='Percent responding "Not wrong at all"')

The percentage of people who accept homosexuality was almost unchanged in the 1970s and 1980s, and began to increase quickly around 1990. For a discussion of this trend, and similar trends related to racism and sexism, you might be interested in Chapter 11 of Probably Overthinking It.

Discussion¶

Computing the mean of an ordinal variable can be a quick way to make comparisons between groups or show trends over time. The computation implicitly assumes that the points on the scale are equally spaced, which is not true in general, but in many cases it is close enough.

However, even when the mean (or standard deviation) is a reasonable choice, there is often an alternative that is easier to interpret in context.

It’s always good to look at the distribution before choosing a summary statistic. If it’s bimodal, the mean is probably not the best choice.

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

Appendix: Quantifying discord¶

We’ve seen that mean absolute deviation (MAD) can quantify the average level of disagreement between two people in a population. As a generalization, it can also quantify the discord between two groups.

I might come back and expand this example later.

In [37]:

male = gss.query('sex == 1')
female = gss.query('sex == 2')

pmf_male = Pmf.from_seq(male['polviews'])
pmf_female = Pmf.from_seq(female['polviews'])

In [38]:

labels = list(polviews_dict.values())
plt.barh(labels, pmf_male, height=0.45, align='edge')
plt.barh(labels, pmf_female, height=-0.45, align='edge')
decorate(xlabel='PMF')

In [39]:

ps = np.outer(pmf_male, pmf_female)
qs = np.abs(np.subtract.outer(pmf_male.index, pmf_female.index))
np.sum(ps * qs)

Out[39]:

1.5404976377345188

In [40]:

color_map = {1: 'C0', 2: 'C1'}
sex_dict = {1: 'Male', 2: 'Female'}

for name, group in gss.groupby('sex'):
    series_mean = group.groupby('year')['polviews'].mean()
    plot_series_lowess(series_mean, color_map[name], label=sex_dict[name])
    
decorate(xlabel='Year',
         ylabel='Percentage points',
         title='Mean')

In [41]:

def compute_diff(df):
    xtab = pd.crosstab(df['year'], df['polviews'], normalize='index')
    lib = xtab[[1, 2, 3]].sum(axis=1) * 100
    con = xtab[[5, 6, 7]].sum(axis=1) * 100
    return con - lib

In [42]:

color_map = {1: 'C0', 2: 'C1'}
sex_dict = {1: 'Male', 2: 'Female'}

for name, group in gss.groupby('sex'):
    diff = compute_diff(group)
    plot_series_lowess(diff, color_map[name], label=sex_dict[name])
    
decorate(xlabel='Year',
         ylabel='Percentage points',
         title='Difference %conservative - %liberal')

Testing Percentiles

April 28, 2024 AllenDowney

Here’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

test_percentile

Testing percentiles¶

Here’s a question from the Reddit statistics forum.

I have two different samples (about 100 observations per sample) drawn from the same population (or that’s what I hypothesize; the populations may in fact be different). The samples and population are approximately normal in distribution.

I want to estimate the 85th percentile value for both samples, and then see if there is a statistically significant difference between these two values. I cannot use a normal z- or t-test for this, can I? It’s my current understanding that those tests would only work if I were comparing the means of the samples.

As an extension of this, say I wanted to compare one of these 85th percentile values to a fixed value; again, if I was looking at the mean, I would just construct a confidence interval and see if the fixed value fell within it…but the percentile stuff is throwing me for a loop.

This is […] related to a research project I’m working on (in my job).

There are two questions here. The first is about testing a difference in percentiles between two groups. The second is about the difference between a percentile from an observed sample and an expected value.

We’ll answer the first question with a permutation test, and we’ll answer the second in two ways: bootstrap resampling and a Gaussian model.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

In [2]:

# install the empiricaldist library, if necessary

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

Data¶

Since OP didn’t provide a dataset, we have to generate one. I’ll draw two samples from Gaussian distributions with the same standard deviation and different means.

In [3]:

np.random.seed(17)

mu = 10
sigma = 2
size = 100

group1 = np.random.normal(mu, sigma, size=size)
group2 = np.random.normal(mu+1, sigma, size=size)

Here’s what the distributions of the groups look like.

In [4]:

sns.kdeplot(group1, label='Group1')
sns.kdeplot(group2, label='Group2')

decorate(xlabel='Quantity',
         ylabel='Density',
         title='Distributions of the data')

If we compute the 85th percentile in both groups, we see a difference, as expected.

In [5]:

stat1 = np.percentile(group1, 85)
actual_stat2 = np.percentile(group2, 85)
stat1, actual_stat2

Out[5]:

(12.241826987876475, 13.057003640622057)

In [6]:

actual_diff = actual_stat2 - stat1
actual_diff

Out[6]:

0.8151766527455813

Now let’s see if a difference of that size would be likely if the two samples were actually drawn from the same distribution.

Testing the Difference¶

When we test a difference between two groups, the usual model of the null hypothesis is that the groups are actually identical. If that’s true, the two samples came from the same distribution, so we can combine them into a single sample.

In [7]:

pooled = np.concatenate([group1, group2])
n, m = len(group1), len(group2)

Now we can simulate the null hypothesis by permutation — that is, by shuffling the pooled data and splitting it into two groups with the same sizes as the originals. The following function generates two samples under this assumption, and returns the difference in their 85th percentiles.

In [8]:

def simulate_percentile_difference():
    np.random.shuffle(pooled)
    shuffled1 = pooled[:n]
    shuffled2 = pooled[n:]
    diff = np.percentile(shuffled1, 85) - np.percentile(shuffled2, 85)
    return diff

If we call it many times, the result is a sample from the distribution of differences under the null hypothesis.

In [9]:

np.random.seed(19)
sample_diff = [simulate_percentile_difference() for i in range(1001)]

Here’s what it looks like, with a vertical line at the observed difference. I’m plotting it with cut=0 so the estimated density doesn’t extend past the minimum and maximum of the data.

In [10]:

sns.kdeplot(sample_diff, label='', cut=0)
plt.axvline(actual_diff, ls=':', color='gray')
decorate(xlabel='Difference',
         ylabel='Density',
         title='Distribution of differences under H0')

The distribution is multimodal, which is the result of selecting a moderately high percentile from a moderately small dataset — the diversity of the results is limited. However, in this example we are interested in the tails of the distribution, so multimodality is not a problem.

To estimate a one-sided p-value, we can compute the fraction of the sample that exceeds the actual difference.

In [11]:

p_value_one_sided = (sample_diff >= actual_diff).mean()
p_value_one_sided

Out[11]:

0.04195804195804196

Or, for a two-sided p-value, we can compute the fraction of the sample that exceeds the actual difference in absolute value.

In [12]:

p_value_two_sided = (np.abs(sample_diff) > actual_diff).mean()
p_value_two_sided

Out[12]:

0.07892107892107893

In this example, the result of the one-sided test would be considered significant at the 5% significance level, but the two-sided test would not. So which is it?

I think it’s not worth worrying about. My interpretation of the results is the same either way: they are inconclusive. Under the null hypothesis, a difference as big as the one we saw would be unlikely, but we can’t rule out the possibility that the groups are identical — or nearly so — and the apparent difference is due to random variation.

Testing a fixed value¶

Now let’s turn to the second question. Suppose we have reason to think that the actual value of the 85th percentile is 12.3, and we would like to know whether the data contradict this hypothesis.

In [13]:

expected = 12.3

We’ll test group1 first. Here’s the 85th percentile of group1 and its difference from the expected value.

In [14]:

actual_stat1 = np.percentile(group1, 85)
actual_diff1 = actual_stat1 - expected
actual_stat1, actual_diff1

Out[14]:

(12.241826987876475, -0.058173012123525325)

Let’s see if a difference of this magnitude is likely to happen under the null hypothesis. One way to model the null hypothesis is to create a dataset that is similar to the observed data, but where the 85th percentile is exactly as expected. We can do that by shifting the observed data by the observed difference.

In [15]:

shifted = group1 - actual_diff1
np.percentile(shifted, 85)

Out[15]:

12.3

The 85th percentile of the shifted data is the expected value, exactly.

Now, to generate samples under the null hypothesis, we can use the following function, which takes a sample, shifts it to have the expected value of the 85th percentile, generates a bootstrap resample of the shifted values, and returns the difference between the 85th percentile of the sample and the expected value.

In [16]:

def bootstrap_percentile(group):
    stat = np.percentile(group, 85) - expected
    shifted = group - stat
    resampled = np.random.choice(shifted, size=len(group), replace=True)
    return np.percentile(resampled, 85) - expected

If we call this function many times, we get a sample of the differences we expect under the null hypothesis.

In [17]:

np.random.seed(17)
sample1 = [bootstrap_percentile(group1) for i in range(1001)]

The following function shows the distribution of the sample with a vertical line at the observed value.

In [18]:

sns.kdeplot(sample1, label='Sampling distribution')
plt.axvline(actual_diff1, ls=':', color='gray')
decorate(xlabel='Deviation',
         ylabel='Density',
         title='Distribution of deviations from expected under H0')

Without computing a p-value, we can see that a difference as big as actual_diff1 is entirely plausible under the null hypothesis. We can confirm that by computing a one-sided p-value.

In [19]:

p_value_one_side = (sample1 < actual_diff1).mean()
p_value_one_side

Out[19]:

0.4405594405594406

So the observed difference in the first group is not statistically significant. Now let’s do the same thing for the second group.

In [20]:

actual_stat2 = np.percentile(group2, 85)
actual_diff2 = actual_stat2 - expected
actual_diff2

Out[20]:

0.7570036406220559

In [21]:

np.random.seed(17)
sample2 = [bootstrap_percentile(group2) for i in range(1001)]

Here’s what the distribution of differences looks like under the null hypothesis, with a vertical line at the observed value.

In [22]:

sns.kdeplot(sample2, label='Group 2', cut=0)
plt.axvline(actual_diff2, ls=':', color='gray')
decorate(xlabel='Deviation',
         ylabel='Density',
         title='Distribution of deviations from expected under H0')

There are no differences in the sample that exceed the observed value.

In [23]:

np.max(sample2), actual_diff2

Out[23]:

(0.6391539586773582, 0.7570036406220559)

We can conclude that a difference as big as that is very unlikely under the null hypothesis. There’s not much point in computing a p-value more precisely than that, but if it’s required, we can estimate it if we assume that the tail of the sampling distribution is roughly Gaussian. In that case, we can fit a KDE to the sampling distribution like this.

In [24]:

from scipy.stats import gaussian_kde

kde = gaussian_kde(sample2)

And use a Pmf object to approximate the estimated density.

In [25]:

from empiricaldist import Pmf

qs = np.linspace(-2, 2, 201)
ps = kde.evaluate(qs)
pmf = Pmf(ps, qs)
pmf.normalize()

Out[25]:

49.99999999999998

Then we can use the corresponding CDF to compute the probability of a value that exceeds the observed difference.

In [26]:

cdf = pmf.make_cdf()
p_value = 1 - cdf(actual_diff)
p_value

Out[26]:

3.319666203671634e-05

So the p-value is quite small.

Model-based resampling¶

The bootstrap method in the previous section is a good choice if we are unsure about the distribution of the data, or if there are outliers. But multiple modes in the sampling distribution suggest that there might not be enough diversity in the data for bootstrapping to be reliable.

Fortunately, there is another way we might model the null hypothesis: using a Gaussian distribution. If we generate data from a continuous distribution, we expect the sampling distribution to be unimodal.

But there is a problem we have to solve first — we have to make an assumption about the standard deviation of the hypothetical Gaussian distribution. One option is to use the standard deviation of the data.

In [27]:

s = np.std(group1)

Now we need to find a Gaussian distribution with a given standard deviation that has the expected 85th percentile. We can do that by starting with a distribution centered at 0, computing it’s 85th percentile and then shifting it.

In [28]:

from scipy.stats import norm

dist0 = norm(0, s)
quantity = dist0.ppf(0.85)
quantity

Out[28]:

2.3232308032911324

ppf stands for “percentile point function”, which is another name for the quantile function, which is the inverse of the CDF — it takes a cumulative probability and returns the corresponding quantity.

In [29]:

center = expected - quantity
dist = norm(center, s)
dist.ppf(0.85)

Out[29]:

12.3

The following function takes one of the groups, fits a hypothetical model to it, generates a sample from the model, and returns the difference between the 85th percentile of the sample and the expected value.

In [30]:

def gaussian_percentile(group):
    s = np.std(group)
    dist0 = norm(0, s)
    quantity = dist0.ppf(0.85)
    center = expected - quantity
    dist = norm(center, s)
    sample = dist.rvs(size=len(group))
    return np.percentile(sample, 85) - expected

If we call this function many times, we get the sampling distribution of the test statistic under the null hypothesis.

In [31]:

np.random.seed(17)
sample3 = [gaussian_percentile(group1) for i in range(1001)]

Here’s what the distribution looks like, compared to the corresponding distribution from the bootstrapped model.

In [32]:

sns.kdeplot(sample1, label='Bootstrap model', cut=0)
sns.kdeplot(sample3, label='Gaussian model', cut=0)
plt.axvline(actual_diff1, ls=':', color='gray')
decorate(xlabel='Quantity',
         ylabel='Density',
         title='Distribution of deviations from expected under H0, Group 1')

The shapes of the distributions are different, but their ranges are comparable. And the conclusion is the same: a difference as big as actual_diff1 is entirely plausible under the null hypothesis.

In [33]:

p_value_one_side = (sample3 < actual_diff1).mean()
p_value_one_side

Out[33]:

0.48451548451548454

Now let’s try the same test with Group 2.

In [34]:

np.random.seed(17)
sample4 = [gaussian_percentile(group2) for i in range(1001)]

Here’s the result, along with the result from the bootstrap model.

In [35]:

sns.kdeplot(sample2, label='Bootstrap model', cut=0)
sns.kdeplot(sample4, label='Gaussian model', cut=0)
plt.axvline(actual_diff2, ls=':', color='gray')
decorate(xlabel='Quantity',
         ylabel='Density',
         title='Distribution of deviations from expected under H0, Group 2')

Again, the shapes of the distributions are different, but the conclusion is the same. A difference as big as actual_diff2 is unlikely under the null hypothesis.

In [36]:

p_value_one_side = (sample4 > actual_diff2).mean()
p_value_one_side

Out[36]:

0.003996003996003996

As usual, the two-sided p-value is bigger by a factor of two, roughly, but the difference never matters in practice.

In [37]:

p_value_two_sided = (np.abs(sample4) > actual_diff2).mean()
p_value_two_sided

Out[37]:

0.006993006993006993

Under this model of the null hypothesis, the probability is small that the 85th percentile of the data would exceed the expected value by so much.

Discussion¶

This example demonstrates a kind of inconsistency in hypothesis testing. We found that Group 1 is not significantly different from the expected value — in the technical sense of significantly — but Group 2 is. So that suggests that Group 1 and Group 2 are different from each other, but when we test that hypothesis, the difference is not statistically significant.

People who are new to hypothesis testing find results like this surprising, but they are not rare. Generally, they are a consequence of the logic of null hypothesis testing and the arbitrariness of the significance threshold.

I think it helps to interpret p-values qualitatively.

A p-value greater than 10% means that the observed effect is plausible under the null hypothesis, and could happen by chance.
A p-value less than 1% means that an observed effect is unlikely under the null hypothesis — so it is unlikely to have happened by chance.
Anything in between is inconclusive.

There is nothing special about 5%.

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

In [ ]:

Small percentiles and missing data

April 26, 2024 AllenDowney

Here’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

low_percentile

Bootstrapping percentiles¶

Here’s a question from the Reddit statistics forum.

I’m trying to figure out how to determine the confidence interval for the .2 percentile temperature for specific set of observed temperatures (all hourly temperatures during January, February, and December since 2000). I have recordings for 53128 of the 53424 possible hourly recordings.

How would I go about saying that I am X% sure that the actual .2 percentile value is between two numbers? Could anyone provide any insight on how to accomplish this. Thank you.

OP provided a link to the data, so this is a question we can answer! For computing confidence intervals, my first choice is bootstrap resampling, but as it turns out, it does not work well for this problem. I’ll show what goes wrong and how to fix it. Then we’ll answer a follow-up question about quantifying the effect of missing data.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

In [2]:

# install the empiricaldist library, if necessary

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

Data¶

I downloaded the data as a CSV file, which we can read into a Pandas DataFrame.

In [3]:

# download the data

download('https://github.com/AllenDowney/DataQnA/raw/main/data/temperature_data_ama.csv')

Out[3]:

'temperature_data_ama.csv'

In [4]:

col = 'Date/Time'
df = pd.read_csv('temperature_data_ama.csv', parse_dates=[col], index_col=col)
df.head()

Out[4]:

	tmpf
Date/Time
2000-01-01 00:00:00	NaN
2000-01-01 01:00:00	NaN
2000-01-01 02:00:00	NaN
2000-01-01 03:00:00	NaN
2000-01-01 04:00:00	NaN

There are 53424 measurements, of which 306 are missing.

In [5]:

len(df)

Out[5]:

In [6]:

missing = df['tmpf'].isna()
missing.sum()

Out[6]:

The range of temperatures is from -9.9 degF to 89 degF.

In [37]:

data_clean = df['tmpf'].dropna()
data_clean.describe()

Out[37]:

count    53118.00000
mean        38.28920
std         13.71636
min         -9.90000
25%         29.00000
50%         37.00000
75%         47.00000
max         89.00000
Name: tmpf, dtype: float64

And the 0.2 percentile is 1 degF.

In [8]:

np.percentile(data_clean, 0.2)

Out[8]:

1.0

Basic bootstrap¶

The following function takes the cleaned data, resamples it, and computes the 0.2 percentile of the bootstrapped sample.

In [9]:

def bootstrap_percentile(data):
    resampled = np.random.choice(data, size=len(data), replace=True)
    return np.percentile(resampled, 0.2)

If we call this function 1001 times, we get a sample from the sampling distribution of the percentile.

In [10]:

np.random.seed(17)
sample = [bootstrap_percentile(data_clean) for i in range(1001)]

Here’s what that sample looks like.

In [11]:

sns.histplot(sample)
decorate(xlabel='')

Immediately we can see that something has gone wrong. The resampling process produces only 8 unique values.

In [12]:

np.unique(sample)

Out[12]:

array([0.    , 0.234 , 1.    , 1.0936, 1.2106, 1.4   , 1.517 , 1.9   ])

If we try to compute a CI by pulling percentiles from the sample, the results are not credible.

In [13]:

np.percentile(sample, [5, 95])

Out[13]:

array([1.    , 1.0936])

This example demonstrates a limitation of bootstrap resampling — it does not work well when there are a small number of unique values.

However, because the data are temperature measurements, they are actually continuous quantities. So one option is to replace bootstrapping with a model that generates continuous quantities. We’ll try that with a normal model, see that it does not work, and they try again with KDE.

Resampling from a normal model¶

If we look at the CDF of the data, it resembles the characteristic sigma of the normal distribution.

In [14]:

from empiricaldist import Cdf

cdf_data = Cdf.from_seq(data_clean)
cdf_data.plot(label='data')

decorate(xlabel='Temperature (degF)',
         ylabel='CDF')

So let’s see how it compares to a normal model. I’ll estimate the parameters by computing the mean and standard deviation of the data.

In [15]:

from scipy.stats import norm

mu = np.mean(data_clean)
sigma = np.std(data_clean)
dist = norm(mu, sigma)

And compute the normal CDF within 4 standard deviations of the mean.

In [16]:

low, high = mu - 4*sigma, mu + 4*sigma
xs = np.linspace(low, high, 201)
ys = dist.cdf(xs)

Here’s what the model looks like compared to the data.

In [17]:

plt.plot(xs, ys, color='gray', label='Normal model')
cdf_data.plot(label='data')

decorate(xlabel='Temperature (degF)',
         ylabel='CDF')

It looks pretty good, but there are places where the data clearly deviate from the model. That’s enough to make me worry, but let’s proceed and see how it goes.

The following function takes the cleaned data, generates a random sample from the normal model, and returns the 0.2 percentile of the sample.

In [18]:

def resample_percentile_norm(data):
    resampled = dist.rvs(len(data))
    return np.percentile(resampled, 0.2)

If we call it 1001 times, we hope the result is a sample from the sampling distribution of the percentile.

In [19]:

np.random.seed(17)
sample2 = [resample_percentile_norm(data_clean) for i in range(1001)]

And at first glance it looks good.

In [20]:

sns.kdeplot(sample2, label='Sampling distribution')
decorate(xlabel='Temperature (degF)',
         ylabel='Density')

But notice that range of the sampling distribution does not include the 0.2 percentile of the data, which is 1. We can compute a 90% CI, but again, it is not credible.

In [21]:

ci90 = np.percentile(sample2, [5, 95])
ci90

Out[21]:

array([-1.84803327, -0.46534009])

To see what went wrong, let’s look at the normal model and the data again, this time with the y axis on a log scale. The log scale is like a microscope that lets us see more clearly what is happening in the tail of the distribution.

In [22]:

plt.plot(xs, ys, color='gray', label='Normal model')
cdf_data.plot(label='data')

decorate(xlabel='Temperature (degF)',
         ylabel='CDF',
         yscale='log')

On a linear scale, it seemed like the normal model might be good enough; on a log scale, it is clear that the data deviate from the model in the left tail.

In retrospect, it is not a surprise if a simple two-parameter model fails to capture every detail of the distribution — the world is a complicated place. So let’s try a nonparametric approach.

Resampling with KDE¶

We can use kernel density estimation (KDE) to model the distribution of the data, then use the model to resample. Here’s how we estimate the distribution.

In [23]:

from scipy.stats import gaussian_kde

kde = gaussian_kde(data_clean)

To see what the result looks like, we can approximate the density of the model with a discrete PMF.

In [24]:

from empiricaldist import Pmf

pmf_kde = Pmf(kde.pdf(xs), xs)
pmf_kde.normalize()

Out[24]:

1.8226580141477107

And then compare the CDF of the model with the CDF of the data.

In [25]:

pmf_kde.make_cdf().plot(color='gray', label='KDE model')
cdf_data.plot(label='data')
decorate(xlabel='Temperature (degF)',
         ylabel='CDF')

The result shows that KDE is doing what it is meant to do — fitting a continuous distribution to the data with minimal assumptions.

The following function takes the cleaned data, uses the KDE model to generate a random sample, and returns the 0.2 percentile of the sample.

In [26]:

def resample_percentile_kde(data):
    resampled = kde.resample(len(data))
    return np.percentile(resampled, 0.2)

If we call it 1001 times, we hope once again that the result is a sample from the sampling distribution of the percentile.

In [27]:

np.random.seed(17)
sample3 = [resample_percentile_kde(data_clean) for i in range(1001)]

And this time we get a better result. The sampling distribution looks good, and it contains the actual percentile of the data.

In [28]:

sns.kdeplot(sample3, label='Sampling distribution')
decorate(xlabel='Temperature (degF)',
         ylabel='Density')

And the width of the 90% CI is plausible.

In [29]:

np.percentile(sample3, [5, 95])

Out[29]:

array([0.27006571, 1.18702612])

So with a couple of false starts, we have answered the original question. But it turns out there’s more.

Fill missing values¶

In a follow-up message, OP wrote:

Just in case it helps any, here’s what I’m ultimately trying to accomplish with this endeavor… I am trying to come up with a plausible way of demonstrating that the .2 percentile value (1 degF) that is derived from this data set is sufficiently representative of what the value would be if there were no missing data points (hourly readings) from the dataset.

OK, that’s a different question! However, the resampling framework can be extended naturally to estimate the effect of missing data. Here’s a function that takes the original data — including NaNs — and fills the missing values with a random selection of valid values. For historical reasons, this way of filling missing values is called “hot deck imputation”.

In [30]:

data_nan = df['tmpf']
valid = data_nan.dropna()
missing = data_nan.isna()

def fill_missing(data):
    filled = data.copy()
    filled[missing] = np.random.choice(valid, size=missing.sum(), replace=True)
    return filled

To test it, we can check that the result has no NaNs.

In [31]:

filled = fill_missing(data_nan)
filled.isna().sum()

Out[31]:

Now we can include fill_missing as part of the resampling pipeline. The following function takes the original data, fills missing values, generates a sample from a KDE model, and returns the 0.2 percentile of the sample.

In [32]:

def resample_percentile_kde_fill(data):
    filled = fill_missing(data)
    kde = gaussian_kde(filled)
    resampled = kde.resample(len(data))
    return np.percentile(resampled, 0.2)

If we call it many times, we get a sample from a distribution that represents the uncertainty of the estimate due to a combination of missing data and random sampling.

In [33]:

np.random.seed(17)
sample4 = [resample_percentile_kde_fill(data_nan) for i in range(1001)]

Here’s what the result looks like, compared to the sampling distribution from the previous section, which represents only uncertainty due to random sampling.

In [34]:

sns.kdeplot(sample3, label='Sampling distribution')
sns.kdeplot(sample4, label='Sampling distribution with fill')
decorate(xlabel='Temperature (degF)',
         ylabel='Density')

The difference does not seem substantial, and the CIs are similar.

In [35]:

np.percentile(sample3, [5, 95])

Out[35]:

array([0.27006571, 1.18702612])

In [36]:

np.percentile(sample4, [5, 95])

Out[36]:

array([0.29608271, 1.18784825])

We can conclude that missing data does not have much effect on the CI.

To estimate the effect more precisely, we could run this again with a sample size of 10,001 rather than 1001. But I won’t bother because with only 306 missing values out of 53,424, I did not expect the missing data to affect the results by much, and this result confirms it. Rather than estimate the CI more precisely, I would conclude that missing data is not a problem, and drop it.

Discussion¶

Normally I am quick to recommend bootstrap resampling because “it just works”. It makes almost no assumptions about the distribution of the data, and it is easy to extend to almost any statistic. But as this example shows, it is not infallible — the kryptonite of bootstrapping is lack of diversity in the data.

To diagnose this problem, it is a good idea to explore the sampling distribution. If bootstrapping goes well, the sampling distribution should have many unique values, and the range should contain the estimate computed directly from the data, usually close to the middle of the CI.

If the results from bootstrapping fail these tests, think about other ways to model the data-generating process. If a parametric model fits the data well, you can use the data to estimate parameters and then use the model to generate simulated samples. Otherwise, consider a non-parametric approach like KDE.

For filling missing data, hot deck imputation ignores serial correlation and other statistical structure in the data, so the imputed values are likely to be unrealistic. But in this case that’s probably a feature, because the results overestimate the effect of missing data. As a result, we can make the argument, “Even if we assume that the missing data is highly variable, it has no substantial effect on the estimated percentile or the computed CI.”

If it were necessary to fill missing data with more realistic values, we could use a time series method like ARIMA or a Gaussian process.

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

What does “strength” mean?

April 21, 2024 AllenDowney

Here’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

corr_trend

What does “strength” mean?¶

Here’s a question from the Reddit statistics forum.

I am currently doing a uni assignment and one of my tasks is analysing the correlation between two variables. When I use the correlation function in Excel, it returns a correlation of -0.0377. When I use the same data to create a scatter plot, the trend line is positive. I need to identify the correlation strength and direction and thereby, I am confused by these opposing outcomes. Can somebody please explain why the correlation is showing as negative but the trend line is positive? What does this indicate in terms of the strength and direction of the relationship between the two variables?

To answer the immediate question, correlation and the slope of a linear regression line always have the same sign. Mathematically, they are both related to the dot product of the x and y variables.

So there is something strange going on. It might be a simple error — for example, maybe the correlation and regression were based on different data. Or it might be that the trend computed by Excel is something other than linear regression. For example, a line that minimizes mean absolute error (MAE) rather than mean squared error (MSE) can have a slope with the opposite sign of the correlation.

Without more information it’s hard to be sure what’s going on, but for this example it might not matter. The computed correlation is negative but very small. If we fit a line (other than a regression line) to the same data and the slope is positive but similarly small, that is not necessarily inconsistent. Within statistical uncertainty, both are indistinguishable from zero.

OP also asks, “What does this indicate in terms of the strength and direction of the relationship between the two variables?” So let’s answer that question, too.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

Interpreting correlation and slope¶

When people talk about the strength of a relationship, they might mean correlation or they might mean the slope of a fitted line. But these measures of “strength” are not always consistent.

For example, suppose we are concerned about the health effects of weight gain, so we plot weight versus age from 20 to 50 years old. I’ll generate two fake datasets to demonstrate the point.

In [2]:

np.random.seed(18)
xs1 = np.linspace(20, 50)
ys1 = 75 + 0.02 * xs1 + np.random.normal(0, 0.15, len(xs1))

In [3]:

np.random.seed(18)
xs2 = np.linspace(20, 50)
ys2 = 65 + 0.2 * xs2 + np.random.normal(0, 3, len(xs2))

I used the same random seed to generate both, so they look similar, as we can see in these scatter plots.

In [4]:

from utils import underride

def text(x, y, string, **options):
    """Plot text using axis coordinates.
    """
    transform = plt.gca().transAxes
    options = underride(options, transform=transform, ha='left', va='top')
    plt.text(x, y, string, **options)

In [5]:

plt.plot(xs1, ys1, 'o', alpha=0.5)
text(0.05, 0.9, 'Fake dataset A')
decorate(xlabel='Age in years',
         ylabel='Weight in kg')

In [6]:

plt.plot(xs2, ys2, 'o', alpha=0.5)
text(0.05, 0.9, 'Fake dataset B')
decorate(xlabel='Age in years',
         ylabel='Weight in kg')

Nevertheless, they have substantially different correlations.

In [7]:

rho1 = np.corrcoef(xs1, ys1)[0][1]
rho1

Out[7]:

0.7579660563439401

In [8]:

rho2 = np.corrcoef(xs2, ys2)[0][1]
rho2

Out[8]:

0.4782776976576317

In the first dataset, the correlation is close to 0.75. In the second, it is close to 0.5. So we might think the first relationship is stronger.

But let’s look at the slopes of the regression lines. For the first dataset, the estimated slope is about 0.019 kilograms per year or about 0.56 kilograms over the 30-year range.

In [9]:

from scipy.stats import linregress

res1 = linregress(xs1, ys1)
res1.slope, res1.slope * 30

Out[9]:

(0.018821034903244386, 0.5646310470973316)

For the second dataset, the estimated slope is almost 10 times higher — about 0.18 kilograms per year or 5.3 kilograms per 30 years.

In [10]:

res2 = linregress(xs2, ys2)
res2.slope, res2.slope * 30

Out[10]:

(0.17642069806488855, 5.292620941946657)

According to the correlations, the first relationship is stronger. According to the slopes, the second relationship is stronger. So which is it? The answer depends on context.

In this example, the slope of the regression line indicates the magnitude of weight gain. If we are concerned about the health effects of weight gain, the second relationship is probably more important.

On the other hand, correlation indicates how well we can predict one value based on the other. If, for some reason, we are trying to guess someone’s weight, based on their age, the first relationship would be more important.

Here are all the results in the same plot.

In [11]:

def make_plot(xs, ys, title):
    """Make a scatter plot with fitted line.
    """
    res = linregress(xs, ys)
    plt.plot(xs, ys, 'o', alpha=0.5)

    fx = np.array([xs.min(), xs.max()])
    fy = res.intercept + res.slope * fx
    plt.plot(fx, fy, '-')

    text(0.05, 0.9, title)
    text(0.05, 0.82, f'correlation = {res.rvalue:0.2f}')
    text(0.05, 0.74, f'slope = {res.slope:0.3f} kg/yr')
    decorate(xlabel='Age in years',
             ylabel='Weight in kg')

In [12]:

plt.figure(figsize=(6, 7))

plt.subplot(2, 1, 1)
make_plot(xs1, ys1, 'Fake dataset A')

plt.subplot(2, 1, 2)
make_plot(xs2, ys2, 'Fake dataset B')

Because of the way the plots are scaled, the slope looks smaller in the second figure, but that’s misleading. So this example is a reminder to look at the labels of the y axis — which is where the effect size often hides.

Minimizing MAE¶

Earlier I said a line that minimizes mean absolute error (MAE) rather than mean squared error (MSE) can have a slope with the opposite sign of the correlation. To demonstrate, I’ll use the following function to minimize MAE.

In [13]:

from scipy.optimize import minimize

def error_func(params, xs, ys):
    intercept, slope = params
    y_pred = intercept + slope * xs
    return np.mean(np.abs(y_pred - ys))

def minimize_mae(xs, ys):
    param0 = [0, 0]
    result = minimize(error_func, param0, args=(xs, ys), method='Nelder-Mead')
    assert result.success
    
    return result.x

Now I’ll generate a dataset where xs and ys are actually uncorrelated.

In [14]:

n = 100

np.random.seed(20)
xs = np.random.normal(0, 1, n)
ys = np.random.normal(0, 1, n)

In this dataset, the correlation is slightly negative and the slope of the fitted line is slightly positive.

In [15]:

corr = np.corrcoef(xs, ys)[0, 1]
intercept, slope = minimize_mae(xs, ys)

corr, slope

Out[15]:

(-0.08198650127894906, 0.04675271007547886)

Here’s what the scatter plot looks like with the minimum MAE line.

In [16]:

fxs = np.array([np.min(xs), np.max(xs)])
fys = intercept + slope * fxs

In [17]:

plt.plot(xs, ys, '.')
plt.plot(fxs, fys)
decorate()

To find this example, I generated datasets with different random number seeds. Out of the first 100 attempts, 19 yield correlation and slope with opposite signs.

In [18]:

count = 0
for i in range(100):
    np.random.seed(i)
    xs = np.random.normal(0, 1, n)
    ys = np.random.normal(0, 1, n)
    corr = np.corrcoef(xs, ys)[0, 1]
    intercept, slope = minimize_mae(xs, ys)
    if corr * slope < 0:
        count += 1
count

Out[18]:

So examples like this are not rare, if the actual correlation is close to zero.

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

In [ ]:

What does a confidence interval mean?

April 17, 2024 AllenDowney

Here’s another installment in Data Q&A: Answering the real questions with Python. In general, I will try to focus on practical problems, but this one is a little more philosophical.

confidence

What does a confidence interval mean?¶

Here’s a question from the Reddit statistics forum (with an edit for clarity):

Why does a confidence interval not tell you that 90% of the time, [the true value of the population parameter] will be in the interval, or something along those lines?

I understand that the interpretation of confidence intervals is that with repeated samples from the population, 90% of the time the interval would contain the true value of whatever it is you’re estimating. What I don’t understand is why this method doesn’t really tell you anything about what that parameter value is.

This is, to put it mildly, a common source of confusion. And here is one of the responses:

From a frequentist perspective, the true value of the parameter is fixed. Thus, once you have calculated your confidence interval, one if two things are true: either the true parameter value is inside the interval, or it is outside it. So the probability that the interval contains the true value is either 0 or 1, but you can never know which.

This response is the conventional answer to this question — it is what you find in most textbooks and what is taught in most classes. And, in my opinion, it is wrong. To explain why, I’ll start with a story.

Suppose Frank and Betsy visit a factory where 90% of the widgets are good and 10% are defective. Frank chooses a part at random and asks Betsy, “What is the probability that this part is good?”

Betsy says, “If 90% of the parts are good, and you choose one at random, the probability is 90% that it is good.”

“Wrong!” says Frank. “Since the part has already been manufactured, one of two things must be true: either it is good or it is defective. So the probability is either 100% or 0%, but we don’t know which.”

Frank’s argument is based on a strict interpretation of frequentism, which is a particular philosophy of probability. But it is not the only interpretation, and it is not a particularly good one. In fact, it suffers from several flaws. This example shows one of them — in many real-world scenarios where it would be meaningful and useful to assign a probability to a proposition, frequentism simply refuses to do so.

Fortunately, Betsy is under no obligation to adopt Frank’s interpretation of probability. She is free to adopt any of several alternatives that are consistent with her commonsense claim that a randomly-chosen part has a 90% probability of being functional.

Now let’s see how this story relates to confidence intervals.

Click here to run this notebook on Colab

I’ll start by importing the usual libraries.

In [1]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

Generating a confidence interval¶

Suppose that Frank is a statistics teacher and Betsy is one of his students. One day Frank teaches the class a process for computing confidence intervals that goes like this:

Collect a sample of size $n$.
Compute the sample mean, $m$, and the sample standard deviation, $s$.
If those estimates are correct, the sampling distribution of the mean is a normal distribution with mean $m$ and standard deviation $s / \sqrt{n}$.
Compute the 5th and 95th percentiles of this sampling distribution. The result is a 90% confidence interval.

As an example, Frank generates a sample with size 100 from a normal distribution with known parameters mean $\mu=10$ and standard deviation $\sigma=3$.

In [2]:

from scipy.stats import norm

mu = 10
sigma = 3

np.random.seed(17)
data = norm.rvs(mu, sigma, size=100)

Then Betsy uses the following function to compute a 90% CI.

In [3]:

def compute_ci(data):
    n = len(data)
    m = np.mean(data)
    s = np.std(data)
    sampling_dist = norm(m, s / np.sqrt(n))
    ci90 = sampling_dist.ppf([0.05, 0.95])
    return ci90

In [4]:

ci90 = compute_ci(data)
ci90

Out[4]:

array([ 9.78147291, 10.88758585])

In this example, we know that the actual population mean is 10 so we can see that this CI contains the population mean. But if we draw another sample, we might get a sample mean that is substantially higher or lower than $\mu$, and the CI we compute might not contain $\mu$.

To see how often that happens, we’ll use this function, which generates a sample, computes a 90% CI, and checks whether the CI contains $\mu$.

In [5]:

def run_experiment(mu, sigma):
    data = norm.rvs(mu, sigma, size=100)
    low, high = compute_ci(data)
    return low < mu < high

If we run this function 1000 times, we can count how often the CI contains $\mu$.

In [6]:

np.mean([run_experiment(mu, sigma) for i in range(1000)]) * 100

Out[6]:

90.60000000000001

The answer is close to 90% — that is, if we run this process many times, 90% of the CIs it generates contain $\mu$ and 10% don’t. So the CI-computing process is like a factory where 90% of the widgets are good and 10% are defective.

Now suppose Frank chooses a different value of $\mu$ and does not tell Betsy what it is. To simulate that scenario, I’ll choose a value from a random number generator with a specific seed.

In [7]:

np.random.seed(17)
unknown_mu = np.random.uniform(10, 20)

And just for good measure, I’ll generate a random value for $\sigma$, too.

In [8]:

unknown_sigma = np.random.uniform(2, 3)

Next Frank generates a sample from a normal distribution with those parameters, and gives the sample to Betsy.

In [9]:

data2 = norm.rvs(unknown_mu, unknown_sigma, size=100)

And Betsy uses the data to compute a CI.

In [10]:

compute_ci(data2)

Out[10]:

array([12.81278165, 13.73152148])

Now suppose Frank asks, “What is the probability that this CI contains the actual value of $\mu$ that I chose?”

Betsy says, “We have established that 90% of the CIs generated by this process contain $\mu$, so the probability that this CI contains $\mu$ is 90%.”

And of course Frank says “Wrong! Now that we have computed the CI, it is unknown whether it contains the true parameter, but it is not random. The probability that it contains $\mu$ is either 100% or 0%. We can’t say it has a 90% chance of containing $\mu$.”

Once again, Frank is asserting a particular interpretation of probability — one that has the regrettable property of rendering probability nearly useless. Fortunately, Betsy is under no obligation to join Frank’s cult.

Under most reasonable interpretations of probability, you can say that a specific 90% CI has a 90% chance of containing the true parameter. There is no real philosophical problem with that.

But there might be practical problems.

Practical problems¶

The process we use to construct a CI takes into account variability due to random sampling, but it does not take into account other problems, like measurement error and non-representative sampling. To see why that matters, let’s consider a more realistic example.

Suppose we want to estimate the average height of adult male residents of the United States. If we define terms like “height”, “adult”, “male”, and “resident of the United States” precisely enough, we have defined a population that has a true, unknown average height. If we collect a representative sample from the population and measure their heights, we can use the sample mean to estimate the population mean and compute a confidence interval.

To demonstrate, I’ll use data from the Behavioral Risk Factor Surveillance System (BRFSS). Here’s an extract I prepared for Elements of Data Science, based on BRFSS data from 2021.

In [11]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/ElementsOfDataScience/raw/v1/data/brfss_2021.hdf')

Out[11]:

'brfss_2021.hdf'

In [12]:

brfss = pd.read_hdf('brfss_2021.hdf', 'brfss')

It includes data from 203,760 male respondents.

In [13]:

male = brfss.query('_SEX == 1')
len(male)

Out[13]:

For 193,701 of them, we have their self-reported height recorded in centimeters.

In [14]:

male['HTM4'].count()

Out[14]:

We can use this data to compute a sample mean and 90% confidence interval.

In [15]:

m = male['HTM4'].mean()
ci90 = compute_ci(male['HTM4'])
m, ci90

Out[15]:

(178.14807357731763, array([178.11896943, 178.17717773]))

Because the sample size is so large, the confidence interval is quite small — its width is only 0.03% of the estimate.

In [16]:

np.diff(ci90) / m * 100

Out[16]:

array([0.03267411])

So there is very little variability in this estimate due to random sampling. That means the estimate is precise, but that doesn’t mean it’s accurate.

For one thing, the measurements in this dataset are self-reported. If people tend to round up — and they do — that would make the estimated mean too high.

For another thing, it is difficult to construct a representative sample of a population as large as the United States. The BRFSS is run by people who know what they are doing, but nothing is perfect — it is likely that some groups are systematically overrepresented or underrepresented. And that could make the estimated mean too high or too low.

Given that there is almost certainly some measurement error and some sampling bias, it is unlikely that the actual population falls in the very small confidence interval we computed.

And that’s true in general — when the sample size is large, variability due to random sampling is small, which means that other sources of error are likely to be bigger. So as sample size increases, the probability decreases that the CI contains the true value.

Summary¶

The way confidence intervals are taught in most statistics class is based on the frequentist interpretation of probability. But you are not obligated to adopt that interpretation, and there are good reasons you should not.

Some people will say that confidence intervals are a frequentist method that is inextricable from the frequentist interpretation. I don’t think that’s true — there is nothing about the computation of a confidence interval that depends on the frequentist interpretation. So you are free to interpret the CI under any philosophy of probability you like.

If you want to say that a 90% CI has a 90% chance of containing the true value, there is nothing wrong with that, philosophically. I think it is a meaningful and useful probabilistic claim.

However, it is only true if other sources of error — like sampling bias and measurement error — are small compared to variability due to random sampling.

For that reason, I think the best interpretation of a confidence interval, for practical purposes, is that it quantifies the precision of the estimate but says nothing about its accuracy.

Credit: I borrowed Frank and Betsy from my friend Ted Bunn. They first appeared in his blog post Who knows what evil lurks in the hearts of men? The Bayesian doesn’t care..