Probably the Book

August 23, 2024 AllenDowney

Last week I had the pleasure of presenting a keynote at posit::conf(2024). When the video is available, I will post it here [UPDATE here it is].

In the meantime, you can read the slides, if you don’t mind spoilers.

For people at the conference who don’t know me, this might be a good time to introduce you to this blog, where I write about data science and Bayesian statistics, and to Probably Overthinking It, the book based on the blog, which was published by University of Chicago Press last December. Here’s an outline of the book with links to excerpts I’ve published in the blog and talks I’ve presented based on some of the chapters.

For your very own copy, you can order from Bookshop.org if you want to support independent bookstores, or Amazon if you don’t.

Twelve Excellent Chapters

In Chapter 1, we learn that no one is normal, everyone is weird, and everyone is about the same amount of weird. I published an excerpt from this chapter, and talked about it during this section of the SuperDataScience podcast. And it is featured in an interactive article at Brilliant.org, which includes this animation showing how measurements are distributed in multiple dimensions.

Chapter 2 is about the inspection paradox, which affects our perception of many real-world scenarios, including fun examples like class sizes and relay races, and more serious examples like our understanding of criminal justice and ability to track infectious disease. I published a prototype of this chapter as an article called “The Inspection Paradox is Everywhere“, and gave a talk about it at PyData NYC:

Chapter 3 presents three consequences of the inspection paradox in demography, especially changes in fertility in the United States over the last 50 years. It explains Preston’s paradox, named after the demographer who discovered it: if each woman has the same number of children as her mother, family sizes — and population — grow quickly; in order to maintain constant family sizes, women must have fewer children than their mothers, on average. I published an excerpt from this chapter, and it was discussed on Hacker News.

Chapter 4 is about extremes, outliers, and GOATs (greatest of all time), and two reasons the distribution of many abilities tends toward a lognormal distribution: proportional gain and weakest link effects. I gave a talk about this chapter for PyData Global 2023:

Chapter 5 is about the surprising conditions where something used is better than something new. Most things wear out over time, but sometimes longevity implies information, which implies even greater longevity. This property has implications for life expectancy and the possibility of much longer life spans. I gave a talk about this chapter at ODSC East 2024 — there’s no recording, but the slides are here.

Chapter 6 introduces Berkson’s paradox — a form of collision bias — with some simple examples like the correlation of test scores and some more important examples like COVID and depression. Chapter 7 uses collision bias to explain the low birthweight paradox and other confusing results from epidemiology. I gave a “Talk at Google” about these chapters:

Chapter 8 shows that the magnitudes of natural and human-caused disasters follow long-tailed distributions that violate our intuition, defy prediction, and leave us unprepared. Examples include earthquakes, solar flares, asteroid impacts, and stock market crashes. I gave a talk about this chapter at SciPy 2023:

The talk includes this animation showing how plotting a tail distribution on a log-y scale provides a clearer picture of the extreme tail behavior.

Chapter 9 is about the base rate fallacy, which is the cause of many statistical errors, including misinterpretations of medical tests, field sobriety tests, and COVID statistics. It includes a discussion of the COMPAS system for predicting criminal behavior.

Chapter 10 is about Simpson’s paradox, with examples from ecology, sociology, and economics. It is the key to understanding one of the most notorious examples of misinterpretation of COVID data. This is the first of three chapters that use data from the General Social Survey (GSS).

Chapter 11 is about the expansion of the Moral Circle — specifically about changes in attitudes about race, gender, and homosexuality in the U.S. over the last 50 years. I published an excerpt about the remarkable decline of homophobia since 1990, featuring lyrics from “A Message From the Gay Community“.

Chapter 12 is about the Overton Paradox, a name I’ve given to a pattern observed in GSS data: as people get older, their beliefs become more liberal, on average, but they are more likely to say they are conservative. This chapter is the basis of this interactive lesson at Brilliant.org. And I gave a talk about it at PyData NYC 2022:

There are still a few chapters I haven’t given a talk about, so watch this space!

Again, you can order the book from Bookshop.org if you want to support independent bookstores, or Amazon if you don’t.

Supporting code for the book is in this GitHub repository. All of the chapters are available as Jupyter notebooks that run in Colab, so you can replicate my analysis. If you are teaching a data science or statistic class, they make good teaching examples.

Chapter 1: Are You Normal? Hint: No.

Run the code on Colab

Run the code that prepares the BRFSS data

Run the code that prepares the Big Five data

Chapter 2: Relay Races and Revolving Doors

Run the code on Colab

Chapter 3: Defy Tradition, Save the World

Run the code on Colab

Chapter 4: Extremes, Outliers, and GOATs

Run the code on Colab

Run the code that prepares the BRFSS data

Run the code that prepares the NSFG data

Chapter 5: Bettter Than New

Run the code on Colab

Chapter 6: Jumping to Conclusions

Run the code on Colab

Chapter 7: Causation, Collision, and Confusion

Run the code on Colab

Run the code that prepares the NCHS data

Chapter 8: The Long Tail of Disaster

Run the code on Colab

Run the code that prepares the earthquake data

Run the code that prepares the solar flare data

Chapter 9: Fairness and Fallacy

Run the code on Colab

Chapter 10: Penguins, Pessimists, and Paradoxes

Run the code on Colab

Run the code that prepares the GSS data

Chapter 11: Changing Hearts and Minds

Run the code on Colab

Chapter 12: Chasing the Overton Window

Run the code on Colab

Too many bronze medals?

August 18, 2024 AllenDowney

In a recent video, Hank Green nerd-sniped me by asking a question I couldn’t not answer.

At one point in the video, he shows “a graph of the last 20 years of Olympic games showing the gold, silver, and bronze medals from continental Europe. And it “shows continental Europe having significantly more bronze medals than gold medals.”

Hank wonders why and offers a few possible explanations, finally settling on the one I think is correct:

… the increased numbers of athletes who come from European countries weight them more toward bronze, which might actually be a more randomized medal. Placing gold might just be a better judge of who is first, because gold medal winners are more likely to be truer outliers, while bronze medal recipients are closer to the middle of the pack. And so randomness might play a bigger role, which would mean that having a larger number of athletes gives you more bronze medal winners and more athletes is what you get when you lump a bunch of countries together.

In the following notebook, I use a simple simulation to show that this explanation is plausible. Click here to run the notebook on Colab. Or read the details below.

olympics

So Many Bronze¶

In a recent video, Hank Green nerd-sniped me by asking a question I couldn’t not answer.

No description has been provided for this image

Hank wonders why and offers a few possible explanations, finally settling on the one I think is correct:

… the increased numbers of athletes who come from European countries weight them more toward bronze, which might actually be a more randomized medal. Placing gold might just be a better judge of who is first, because gold medal winners are more likely to be truer outliers, while bronze medal recipients are closer to the middle of the pack. And so randomness might play a bigger role, which would mean that having a larger number of athletes gives you more bronze medal winners and more athletes is what you get when you lump a bunch of countries together.

In the following simulations, I show that this explanation is plausible. If you like this kind of analysis, you might like my book, Probably Overthinking It.

Click here to run this notebook on Colab.

In [1]:

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

In [2]:

from os.path import basename, exists


def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + local)


download(
    "https://github.com/AllenDowney/ProbablyOverthinkingIt/raw/book/examples/utils.py"
)

In [3]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
plt.rcParams["figure.dpi"] = 75
plt.rcParams["figure.figsize"] = [6, 3.5]

Simulate the Olympics¶

The following function takes a random distribution, generates a population of athletes with random abilities, and returns the top three.

In [4]:

def generate(dist, n, label):
    """Generate the top 3 athletes from a country with population n.
    
    dist: distribution of ability
    n: population
    label: name of country
    """
    # generate a sample with the given size
    sample = dist.rvs(n)
    
    # select the top 3
    top3 = top_k = np.sort(sample)[-3:]
    
    # put the results in a DataFrame with country labels
    df = pd.DataFrame(dict(ability=top3))
    df['label'] = label
    return df

Here’s an example based on a normal distribution with mean 500 and standard deviation 100.

In [5]:

from scipy.stats import norm

dist = norm(500, 100)
generate(dist, 300, 'Example')

Out[5]:

	ability	label
0	746.324211	Example
1	772.016917	Example
2	885.273149	Example

Now let’s simulate the trials in two regions:

A single large country called “UnaGrandia”, with population of 30,000 athletes,
And a group of ten smaller countries called “MultiParvia” with 3,000 athletes each

In [6]:

def run_trials(dist):
    """Simulate the trials.
    
    dist: distribution of ability
    """
    # generate athletes from 10 countries with population 30
    dfs = [generate(dist, 3000, 'MultiParvia') for i in range(10)]
    
    # add in athletes from one country with population 300
    dfs.append(generate(dist, 30000, 'UnaGrandia'))
    
    # combine into a single DataFrame
    athletes = pd.concat(dfs)
    return athletes

The result is 33 athletes, 3 from UnaGrandia and 30 from the various countries of MultiParvia.

In [7]:

athletes = run_trials(dist)
athletes['label'].value_counts()

Out[7]:

label
MultiParvia    30
UnaGrandia      3
Name: count, dtype: int64

Here’s what the distribution of ability looks like.

In [8]:

from empiricaldist import Surv
from utils import decorate

surv_ability = Surv.from_seq(athletes['ability'], normalize=False)
surv_ability.plot(style='o', alpha=0.6, label='')
decorate(xlabel='Ability', ylabel='Rank', title='Distribution of ability')

Because we’ve selected the largest values from the distribution of ability, the result is skewed to the right — that is, there are a few extreme outliers who have the best chances of winning, and a middle of the pack that have fewer chances (with a reminder that it’s a pretty elite pack to be in the middle of).

Now let’s simulate the competition. The following function takes the distribution of ability and an additional parameter, std, that controls the randomness of the results.

When std is 0, the outcome of the competition depends only on the abilities of the athletes — the athlete with the highest ability wins every time.
As std increases, the outcome is more random, so an athlete with a lower ability has a better chance of beating an athlete with higher ability.

In [9]:

medals = ['Gold', 'Silver', 'Bronze']

def compete(dist, std=0):
    """Simulate a competition.
    
    dist: distribution of ability
    std: standard deviation of randomness
    """
    # run the trials
    athletes = run_trials(dist)
    
    # add a random factor to ability to get scores
    randomness = norm(0, std).rvs(len(athletes))
    athletes['score'] = athletes['ability'] + randomness
    
    # select and return athlete with top 3 scores
    podium = athletes.nlargest(3, columns='score')
    podium['medal'] = medals
    return podium

The result shows the abilities of each winner, which region they are from, their score in the competition, and the medal they won.

In [10]:

compete(dist, std=10)

Out[10]:

	ability	label	score	medal
2	920.202590	UnaGrandia	926.182143	Gold
0	876.618008	UnaGrandia	884.973475	Silver
1	876.623360	UnaGrandia	877.887775	Bronze

Now let’s simulate multiple events. The following function takes the distribution of ability again, along with the number of events and the amount of randomness in the outcomes.

In [11]:

def games(dist, num_events, std=0):
    """Simulate multiple games.
    
    dist: distribution of abilities
    num_events: how many events are contested
    """
    dfs = [compete(dist, std) for i in range(num_events)]
    results = pd.concat(dfs)
    xtab = pd.crosstab(results['label'], results['medal'])
    return xtab[medals]

The result is a table that shows the number of each kind of medal won by each region.

In [12]:

table = games(dist, 100, std=20)
table

Out[12]:

medal	Gold	Silver	Bronze
label
MultiParvia	53	49	69
UnaGrandia	47	51	31

The following function plots the results.

In [13]:

colors = ['#FFD700', '#C0C0C0', '#CD7F32']

def plot_results(tables):
    plt.figure(figsize=(10, 3))

    for i, table in enumerate(tables):
        ax = plt.subplot(1, 6, i+1)
        plt.axhline(y=500, ls='--', color='gray', alpha=0.4)
        table.loc['MultiParvia'].plot.bar()
        plt.xticks(rotation=45)
        for bar, color in zip(ax.patches, colors):
            bar.set_color(color)
            
        if i>0:
            plt.yticks([])
        xlabel = 2004 + 4*i
        decorate(xlabel=xlabel, ylim=[0, 700], legend=False)
        
        for spine in ax.spines.values():
            spine.set_visible(False)

If there is no randomness in the outcomes, each region wins about half of the medals, and there is no excess of bronze medals for MultiParvia.

In [14]:

tables = [games(dist, 1000, std=0) for i in range(6)]

plot_results(tables)

However, if we add enough randomness that the outcome is not certain, we see a pattern that resembles the actual data:

The two regions get about the same number of gold medals.
MultiParvia gets more bronze medals, and possibly more silver medals, too.

In [15]:

tables = [games(dist, 1000, std=20) for i in range(6)]

plot_results(tables)

The results here are more consistent that what we see in the real data because we simulated 1000 events.

If we increase the amount of randomness, the advantage of sending more athletes to the games is even stronger — and it looks like it has an effect on the number of gold medals as well.

In [16]:

tables = [games(dist, 1000, std=30) for i in range(6)]

plot_results(tables)

Lognormal distribution of ability¶

I was curious to know how the distribution of ability affects the result, so I tried the simulations with a lognormal distribution, too. This choice might be more realistic because the distribution of ability in many fields follows a lognormal distribution — see Chapter 4 of Probably Overthinking It or this article).

Here’s a lognormal distribution that’s a good match for the distribution of Elo scores in chess.

In [17]:

from scipy.stats import lognorm

m, s = 7.08959557, 0.32758329
dist2 = lognorm(s=s, scale=np.exp(m))

Here’s what it looks like.

In [21]:

low, high = 200, 3200
qs = np.linspace(low, high)
ps = dist2.cdf(qs) * 100
plt.plot(qs, ps)

decorate(xlabel="Ability", ylabel="Percentile rank")

If we run the competition with this distribution, we can see that the scale of abilities is higher.

In [19]:

compete(dist2, std=100)

Out[19]:

	ability	label	score	medal
0	4329.190762	UnaGrandia	4471.560302	Gold
2	4579.134646	UnaGrandia	4471.322015	Silver
2	4267.159885	MultiParvia	4272.824660	Bronze

And here’s what the results look like.

In [20]:

tables = [games(dist2, 1000, std=200) for i in range(6)]
plot_results(tables)

They are similar to the results with a normal distribution of abilities, so it seems like the shape of the distribution is not an essential reason for the excess of bronze medals.

Conclusions¶

I think Hank is right. If you have two regions with the same population, and one is allowed to send more athletes to the games, it is not much more likely to win gold medals, but notably more likely to win silver and bronze medals — and the size of the excess depends on how much randomness there is in the outcome of the events.

If you like this kind of analysis, you might like my book, Probably Overthinking It.

Bonus¶

One more look at the data — it didn’t really pan out.

In [36]:

dfs = [compete(dist, std=20) for i in range(2000)]
results = pd.concat(dfs)

In [37]:

from empiricaldist import Cdf

results['diff'] = results['score'] - results['ability']
for name, group in results.groupby('medal'):
    cdf = Cdf.from_seq(group['diff']) * 100
    cdf.plot(label=name)
    
decorate(xlabel='Under / over performance', ylabel='Percentile rank')

The code in this notebook and utils.py is under the MIT license.

In [ ]:

Probably Overthinking It

Data science, Bayesian Statistics, and other ideas

Browsed by
Month: August 2024

Probably the Book

August 23, 2024 AllenDowney

Twelve Excellent Chapters

Too many bronze medals?

August 18, 2024 AllenDowney

So Many Bronze¶

Simulate the Olympics¶

Lognormal distribution of ability¶

Conclusions¶

Bonus¶