In Graphs About Religion, Ryan Burge recently wrote about changing opinions about assisted suicide and how they relate to religion.
As always, when I see survey responses changing over time, I wonder whether it is driven primarily by period or cohort effects. And if you’ve read my last few posts, you know I’ve been working on a Bayesian model to answer that question.
Ryan’s analysis is based on four questions from the General Social Survey (GSS):
Do you think a person has the right to end his or her own life if this person:
Has an incurable disease? (suicide1)
Has gone bankrupt? (suicide2)
Has dishonored his or her family? (suicide3)
Is tired of living and ready to die? (suicide4)
In addition, we’ll look at results from a related question (letdie1):
When a person has a disease that cannot be cured, do you think doctors should be allowed by law to end the patient’s life by some painless means if the patient and his family request it?
The framing of the questions is different: the first four are about the right to end one’s life and the last is about the legality of doctor-assisted suicide.
Before we look at the breakdown of period and cohort effects, here are the results from a model that estimates latent opposition to each proposition as a smooth function over time.
Opposition to suicide is high in three of the scenarios — bankrupt, dishonored family, and tired of living — and lower in the incurable disease scenarios.
In all five questions, opposition has declined over time, although for the incurable disease scenarios, it might have leveled off after 1990.
Doctor-assisted death
Now let’s see if we can decompose these changes into period and cohort effects. We’ll start with the question about doctor-assisted death when the patient has an incurable disease.
As in the previous posts, I used a Bayesian model to estimate a trajectory over time for each birth cohort, shown in the following figure.
Reading from top to bottom, we can see that opposition has declined from one cohort to the next, and reading from left to right, we can see that opposition has varied over time within each cohort.
The following figure shows the cohort component alone, standardized to factor out the period effect.
Opposition to doctor-assisted suicide has declined from more than 40% in the earliest cohorts to 20% among people born in 2006.
A possible explanation for the cohort pattern is that people anchor their moral judgments to the legal environment they encounter when they are young. During the “impressionable years” of late adolescence and early adulthood, existing laws can establish a moral baseline, so that what is illegal is inferred to be wrong, and therefore should remain illegal. As a result, gradual legalization can generate long-run attitudinal change through cohort replacement: people who grow up after a practice becomes legal are less likely to see it as morally problematic.
The following figure shows the period effect alone, along with the results from the time model (which includes both period and cohort effects).
Comparing the two lines, we can conclude that the decline we see over time is entirely due to the cohort effect — when we control for generational replacement, the estimated period effect has generally increased since 1990.
The increase between 1990 and 2005 might reflect increasing moral concern due to advances in life-sustaining medical technology, high-profile legal disputes like the Terri Schiavo case, and broader discussions of the sanctity of life.
The decline between 2005 to 2015 might reflect normalization of assisted dying following legalization in several states (Oregon in 1997, Washington in 2008, and Montana in 2009, Vermont in 2013), along with a shift in public discourse toward autonomy, dignity, and patient choice, reinforced by high-profile cases like Brittany Maynard.
Other Scenarios
The following figure shows the estimated cohort effects for all five questions.
For the incurable disease scenario, opposition has declined from more than 60% in the earliest cohorts to less than 40% among cohorts born after 1950 — although it might have leveled off since then.
In the other scenarios, opposition has also declined from one cohort to the next, but the size of the effect is smaller.
The following figure shows the estimated period effects, controlling for generational replacement.
Since 1990, most of the period effects are small. The only exception is the “tired of living” scenario, where there is some decline over time, independent of generational replacement.
In the next post, we’ll do the same analysis with questions about abortion and the situations where it should be legal or not.
In a previous article, I claimed that Young adults are not very happy. Now the World Happiness Report 2026 has confirmed that young people in North America and Western Europe are less happy than they were fifteen years ago, and less happy than previous generations.
In this article, we’ll look at results from three related questions in the General Social Survey (GSS):
Trust: “Generally speaking, would you say that most people can be trusted or that you can’t be too careful in dealing with people?”
Fair: “Do you think most people would try to take advantage of you if they got a chance, or would they try to be fair?”
Helpful: “Would you say that most of the time people try to be helpful, or that they are mostly just looking out for themselves?”
As we’ll see, young adults in the United States have a more negative outlook than previous generations: they are less likely to say that people can be trusted, that they are fair, or that they are helpful. And we’ll consider connections between this bleak outlook and unhappiness.
Trust
Using the same model from the previous articles, I estimated the percentage who say people can be trusted, following each birth year over time.
Cohort trajectories, percent saying most people can be trusted
With these trajectories, we can decompose the cohort and period effects. The following figure shows the cohort effect, standardized by holding the period effect constant.
Standardized cohort effect with fixed time mix, percent saying most people can be trusted
The level of trust increased between the cohorts born in the 1900s through the 1940s, and then started a steep decline. This is a large cohort effect, dropping about 30 percentage points over 60 years.
The following figure shows the period effect, standardized by holding the cohort mix constant.
Standardized time trend with fixed cohort mix, percent saying most people can be trusted
In contrast, there is almost no period effect.
The conjecture part
About my previous article, one of my former colleagues said he appreciated my attempt to offer explanations, but reminded me that with this kind of data alone, it is hard to say what causes what with any confidence. That’s true, and it’s a good reminder — but we can get some clues:
When we see a strong cohort effect and almost no period effect, that’s evidence that we’re seeing patterns set in childhood.
When we see period effects, we should look for events that affected all cohorts at the same time.
So let’s think about what was happening in the formative years of these cohorts, starting with the 1940 cohort, which was the high point in trust, before the decline:
Cohort 1940 (childhood: 1940–1960): dense local communities, strong civic and religious institutions, frequent face-to-face interaction, and shared media environment.
Cohort 1950 (1950–1970): suburbanization expands, some weakening of community density, television becomes widespread but still shared.
Cohort 1960 (1960–1980): civil rights conflict, Vietnam War, Watergate scandal, rising crime.
Cohort 1970 (1970–1990): reduced civic participation, rising inequality, more cautious parenting, less unstructured social interaction.
Cohort 1980 (1980–2000): increasing inequality, more segregation by class and education, early internet exposure, continued decline in shared institutions.
At this point a multi-generational effect comes into play — the parents of Cohort 1980, born in the 1950s and 1960s, were less trusting than previous generations of parents.
Cohort 1990 (1990–2010): widespread internet use, early social media, more structured childhood, increasing awareness of global risks.
Cohort 2000 (2000–2020): smartphones and social media throughout formative years, algorithmic content, reduced in-person interaction.
If trust is largely set early in life, then differences between cohorts reflect the environments they experienced during their first two decades.
In addition to this question about trust, the GSS includes related questions about fairness and mutual assistance.
Fair
Do you think most people would try to take advantage of you if they got a chance, or would they try to be fair? The following figure shows the percentage who thought people would be fair.
Cohort trajectories, percent saying people would try to be fair
And here’s the cohort effect.
Standardized cohort effect with fixed time mix, percent saying people would try to be fair
And the period effect.
Standardized time trend with fixed cohort mix, percent saying people would try to be fair
The cohort pattern is similar to what we saw in trust: small changes between the 1900s and 1940s cohorts, and then a steep decline — almost 40 percentage points over 60 years.
The period effect is relatively small, varying by only 10 percentage points from lowest to highest point, but it was generally positive until about 2015 (the onset of the Trump Era?).
Helpful
Would you say that most of the time people try to be helpful, or that they are mostly just looking out for themselves?
Here is a period–cohort fingerprint of the responses, showing the percentage who thought people try to be helpful.
Cohort trajectories, percent saying people try to be helpful
Here’s the cohort effect:
Standardized cohort effect with fixed time mix, percent saying people try to be helpful
And the period effect.
Standardized time trend with fixed cohort mix, percent saying people try to be helpful
Again we see the same pattern: little change between the cohorts born between 1900 and 1940, and then a decline of more than 30 percentage points over 60 years.
And again, the period effect is comparatively small and generally increasing — but possibly declining in the most recent cycles of the survey.
Cause and Effect?
It is plausible that the decline in trust is a contributing factor to the decline in happiness. If you believe that people are out to get you, and 80% of your friends agree, that’s not a worldview conducive to a sense of well-being. And generational decline in trust precedes the decline in happiness, so it is at least a potential cause.
The decline in trust-related beliefs also supports the interpretation that recent cohorts are actually unhappy, rather than interpreting the question differently, or being more willing than previous generations to say they are unhappy.
I haven’t done full-on causal modeling to quantify these relationships, but I ran a few regression models to explore. To reduce the number of researcher degrees of freedom, I asked ChatGPT to interpret the results:
Differences in happiness across cohorts appear to be partly explained by differences in social outlook (trust, fairness, helpfulness), and these outlook variables behave like stable, cohort-structured traits rather than period-driven fluctuations.
The AI-generated summary of the experiments follows.
Model 1: Cross-sectional association (complete cases)
Specification:
Outcome: very_happy (binary)
Predictors: trust, fair, helpful (all binary)
Sample: complete cases with all variables observed
Purpose:
Estimate the cross-sectional relationship between social outlook and happiness.
Provides baseline associations without accounting for cohort or period effects.
Interpretation:
Coefficients represent conditional associations among individuals at a point in time.
Answers: Are people with a more positive outlook more likely to be very happy?
Model 2: Outlook + cohort + period (restricted sample)
Specification:
Outcome: very_happy
Predictors:
trust, fair, helpful
cohort_c (mean-centered birth year)
year_c (mean-centered survey year)
Sample: respondents born ≥ 1940 with complete data
Purpose:
Assess whether the outlook–happiness relationship persists after accounting for:
Cohort effects (differences across birth cohorts)
Period effects (changes over survey years)
Interpretation:
Coefficients for outlook variables reflect within-cohort, within-period associations.
Cohort and year coefficients capture linear trends in happiness after controlling for outlook.
Answers:
Are outlook variables still associated with happiness after adjusting for historical context?
Is there an independent cohort or period trend?
Model 3: Cohort + period only (no outlook variables)
Specification:
Outcome: very_happy
Predictors:
cohort_c
year_c
Sample: respondents born > 1940 (larger sample since outlook variables not required)
Purpose:
Estimate total cohort and period effects on happiness without controlling for outlook.
Provides a baseline for comparison with Model 2.
Interpretation:
Cohort and year coefficients reflect combined (direct + indirect) effects.
Comparing to Model 2 shows how much of these effects are accounted for by outlook variables.
Answers:
How does happiness vary across cohorts and over time in aggregate?
How much do these patterns change when outlook is included?
Key Findings
Positive social outlook is associated with higher happiness.
Trust, fairness, and helpfulness all have positive and statistically significant associations with being “very happy.”
Estimated odds ratios:
Trust: ~1.25
Fairness: ~1.36 (strongest)
Helpfulness: ~1.29
These effects are modest in size and explain a small fraction of overall variation (Pseudo R² ≈ 0.016).
These relationships are stable across cohorts and time.
Adding cohort and survey year controls has little effect on the coefficients.
This suggests the outlook–happiness relationship is primarily cross-sectional, not driven by historical shifts.
Cohort and Period Effects
Without controlling for outlook:
Later cohorts are less likely to report being very happy.
There is also a negative period trend (declining happiness over time).
With outlook variables included:
The cohort effect becomes small and statistically insignificant.
The period effect remains negative and significant.
Interpretation
Outlook variables appear to mediate cohort differences in happiness.
Later cohorts tend to report lower trust, fairness, and helpfulness.
These differences account for much of the observed cohort decline in happiness.
Period effects persist independently.
There is a modest downward trend in happiness over time that is not explained by outlook variables.
Data Considerations
Approximately 40% of observations are missing at least one outlook variable, reducing the complete-case sample.
This raises the possibility of selection bias in the estimates.
Bottom Line
A more positive view of others (trust, fairness, helpfulness) is consistently associated with higher happiness.
Differences in these outlook measures help explain why later cohorts report lower happiness.
However, there is also an independent downward trend in happiness over time.
Someone asked me recently why I stopped writing about religion, and I said there were two reasons: One is that the primary dataset I was following stopped updating; the other is that Ryan Burge is doing such a good job, I felt redundant.
His most recent article presents evidence that the Nones have hit a ceiling — that is, that the percentage of people in the U.S. with no religious affiliation, which has consistently increased for several decades, has either leveled off or started to reverse.
He reports on new data from the Cooperative Election Study and the 2024 General Social Survey, including this figure based on the GSS.
The observed percentage of Nones peaked in the 2021 survey and has dropped in the last two cycles. The CES data show a similar pattern, with a much larger sample size. So I’m not going to disagree with Ryan: it sure looks like the rise of the Nones has stalled or even reversed.
However, since I am developing a model that decomposes trends like this into cohort and period effects, we can use it to check whether the turnaround is a cohort or a period effect. It turns out to be both.
The Model
The model assumes that each cohort in each year has an unobserved (latent) propensity to report a religious affiliation or none.
The cohort and period effects are modeled as second-order Gaussian random walks, which means the model assumes these effects evolve smoothly over time, unless the data provide strong evidence otherwise. The amount of smoothing is estimated from the data.
An additional random year effect captures variation from one survey to the next that is not explained by long-term trends, like current events and topics of discussion.
The “time only” version of the model estimates a latent propensity for each cycle of the survey, so the result is a smooth curve through the raw proportions.
The “time-cohort” version estimates a latent propensity for each cohort during each cycle, so the result is a trajectory over time for each birth year.
Results
Here are the results for the time-only model, showing the posterior mean and a 94% credible interval.
The posterior mean indicates that the trend in the latent factor has probably slowed; the credible interval indicates that it might have leveled off or reversed.
And here are the trajectories for each cohort:
Starting at the bottom, we can see that cohorts born between 1900 and 1930 were not very different — fewer than 10% of them were Nones.
People born in the 1940s were increasingly non-religious, but this first wave of secularization stalled in the cohorts born in the 1950s. The second wave got started with people born in the 1960s, and continued until the 2000s cohorts, where it seems to have stalled again.
Decomposition
With these trajectories, we can decompose the cohort and period effects. The following figure shows the cohort effect, standardized by holding the period effect constant.
As we saw in the previous figure, there was a period of relatively fast change in the 1940s cohorts that stalled among people born in the 1950s and then resumed among people born in the 1960s through the 1980s (primarily Gen X).
Again, it looks like the most recent cohorts have leveled off, but with the width of the credible interval, it’s possible that the trend has continued or reversed.
The following figure shows the period effect, standardized by holding the cohort mix constant.
The period effect was generally increasing from 1990 to 2020, but seems to have leveled off or rolled over.
So, if the rise of the Nones has stalled, at least temporarily, it seems to be a combination of a cohort effect among people born after 2000 and a period effect starting around 2020. This decomposition suggests we should look for at least two kinds of explanations:
Differences in the childhood of people born after 2000 that might make them more likely to have a religious affiliation as young adults, and
Events since 2020 that have affected all cohorts in ways that might make them more religious.
I’ll hold off on speculating.
For purposes of comparison, here is the trend from the time-only model (blue) and the standardized time trend from the time-cohort model (purple).
The difference between these lines is the part of the change due to the cohort effect. So we can see that most of the change over this interval is due to generational replacement rather than disaffiliation.
The Chinese edition of Probably Overthinking It is available now (also here)!
If you have the Chinese edition, there are two sections you won’t get to read — so I am including them here.
Here is an excerpt from Chapter 3, including the deleted paragraph:
In the Present
The women surveyed in 1990 rejected the childbearing example of their mothers emphatically. On average, each woman had 2.3 fewer children than her mother. If that pattern had continued for another generation, the average family size in 2018 would have been about 0.8. But it wasn’t.
In fact, the average family size in 2018 was very close to 2, just as in 1990. So how did that happen?
As it turns out, this is close to what we would expect if every woman had one child fewer than her mother. The following distribution shows the actual distribution in 2018, compared to the result if we start with the 1990 distribution and simulate the “one child fewer” scenario.
The means of the two distributions are almost the same, but the shapes are different. In reality, there were more zero- and two-child families in 1990 than the simulation predicts, and fewer one-child families. But at least on average, it seems like women in the U.S. have been following the “one child fewer” policy for the last 30 years.
The scenario at the beginning of this chapter is meant to be light-hearted, but in reality governments in many places and times have enacted policies meant to control family sizes and population growth. Most famously, China implemented a one-child policy in 1980 that imposed severe penalties on families with more than one child. Of course, this policy is objectionable to anyone who considers reproductive freedom a fundamental human right. But even as a practical matter, the unintended consequences were profound.
Rather than catalog them, I will mention one that is particularly ironic: while this policy was in effect, economic and social forces reduced the average desired family size so much that, when the policy was relaxed in 2015 and again in 2021, average lifetime fertility increased to only 1.3, far below the level needed to keep the population constant, near 2.1. Since then, China has implemented new policies intended to increase family sizes, but it is not clear whether they will have much effect. Demographers predict that by the time you read this, the population of China will probably be shrinking [UPDATE: It is.]. The consequences of the one-child policy are widespread and will affect China and the rest of the world for a long time.
And here is an excerpt from Chapter 5, including the deleted explanation.
Child mortality
Fortunately, child mortality has decreased since 1900. The following figure shows the percentage of children who die before age 5 for four geographical regions, from 1900 to 2019. These data were combined from several sources by Gapminder, a foundation based in Sweden that “promotes sustainable global development […] by increased use and understanding of statistics.”
In every region, child mortality has decreased consistently and substantially. The only exceptions are indicated by the vertical lines: the 1918 influenza pandemic, which visibly affected Asia, the Americas, and Europe; World War II in Europe (1939-1945); and the Great Leap Forward in China (1958-1962). In every case, these exceptions did not affect the long-term trend.
[COMMENT: I thought I was being diplomatic by referring generally to the Great Leap Forward — rather than the Great Chinese Famine or “Three Years of Great Famine” (三年大饥荒) — but apparently that was not enough.]
Although there is more work to do, especially in Africa, child mortality is substantially lower now, in every region of the world, than in 1900. As a result most people now are better new than used.
To demonstrate this change, I collected recent mortality data from the Global Health Observatory of the World Health Organization (WHO). For people born in 2019, we don’t know what their future lifetimes will be, but we can estimate it if we assume that the mortality rate in each age group will not change over their lifetimes.
Based on that simplification, the following figure shows average remaining lifetime as a function of age for Sweden and Nigeria in 2019, compared to Sweden in 1905.
Since 1905, Sweden has continued to make progress; life expectancy at every age is higher in 2019 than in 1905. And Swedes now have the new-better-than-used property. Their life expectancy at birth is about 82 years, and it declines consistently over their lives, just like a light bulb.
Unfortunately, Nigeria has one of the highest rates of child mortality in the world: in 2019, almost 8% of babies died in their first year of life. After that, they are briefly better used than new: life expectancy at birth is about 62 years; however, a baby who survives the first year will live another 65 years, on average.
Going forward, I hope we continue to reduce child mortality in every region; if we do, soon every person born will be better new than used. Or maybe we can do even better than that.
Since 1972, the General Social Survey has asked respondents: “Taken all together, how would you say things are these days—would you say that you are very happy, pretty happy, or not too happy?”
The following figure shows how the responses have changed over time and between birth cohorts. Each line represents one birth year.
People born in 1900 were 72 years old when the survey started; at that point, about 37% said they were very happy. In 1990, the last year they were eligible to participate, a little more than 40% said they were very happy. So it seems like they aged well—or possibly the less happy died earlier.
People born in 1910 were a little less happy when the survey started, but by the time they aged out, they also reached 40%. They were the last generation to reach that mark.
Among people born between 1920 and 1950, each cohort was a little less happy than the one before (or maybe less likely to say they were happy). In these cohorts, we can see a general trend over time: increasing until about 2000, leveling off, and declining after 2010.
The cohorts born in the 1960s and 1970s followed a similar trajectory, with only small differences from one birth year to the next.
And then the bottom fell out. Starting with people born in the 1980s (the earliest Millennials), each successive cohort was substantially less happy than the one before.
When people born in 1990 joined the survey in 2008 (at age 18), only 27% said they were very happy. In the most recent data, from 2024, the number had fallen to 22%.
When people born in 2000 entered in 2018, they set a new record low at 21%, which has now fallen to 18%.
And in the most recent cohort—born in 2006 and interviewed in 2024—only 16% said they were very happy.
These percentages are based on a statistical model that estimates the proportion of “very happy” responses in each group at each point in time. The details of the model and its assumptions are below.
The Time Trend
With an estimated proportion for each cohort and time step, we can compute separate contributions for changes over time and between cohorts.
To characterize the contribution of time, we have to hold the cohort effect constant, which we can do by computing the distribution of birth years across the entire dataset and simulating a population where this distribution does not change over time. The following figure shows the result.
The overall level of happiness increased between 1972 and 2000, leveled off, and then declined after 2010.
Of course it is speculation to say why that happened, but we can think about large-scale economic and social patterns and how they line up with these trends.
Economically, 1980 to 2000 was a period of growth and relative stability. That changed after the end of the Dot-com bubble in 2001 and, more importantly, the Global Financial Crisis in 2008, which had broad and persistent effects on employment, wealth, and economic security.
Geopolitically, the 1970s through the 1990s were relatively quiet compared to what followed. The September 11 attacks in 2001, and the wars in Iraq (2003–2011) and Afghanistan (2001–2021) marked a shift toward a more uncertain and conflict-oriented global environment.
Participation in civic organizations and religious institutions declined over the past several decades. These institutions traditionally provided social support, shared identity, and regular face-to-face interaction. Social isolation is strongly associated with lower well-being.
At the same time, the media environment was transformed. The rise of 24-hour news increased exposure to negative and emotionally salient events, and after 2010 the spread of smartphones and social media made that exposure continuous and personalized.
Finally, measures of trust in institutions and other people have generally declined over this period, while political polarization has increased. These trends may reduce people’s sense of stability and shared purpose.
The COVID-19 pandemic likely contributed to the most recent decline, but the downward trend was already underway before 2020.
The Cohort Effect
Just as we isolated the time trend by simulating a survey with a fixed distribution of cohorts, we can isolate the cohort effect by simulating a survey with a fixed distribution of times. The following figure shows the result.
The cohort effect is larger and more consistent than the time trend: the difference between the happiest and least happy cohorts is more than 20 percentage points.
The decline was relatively slow for cohorts born between 1900 and 1950 and nearly zero for cohorts born in the 1950s, 1960s and 1970s (late Baby Boomers and Gen X). The steep decline begins with the Millennials and continues into Gen Z.
Possible explanations for the recent decline include:
Transformation of childhood: Jonathan Haidt has described childhood in recent cohorts as “overprotected in the real world and underprotected in the online world.” Increased parental monitoring, reduced independent play, and greater time spent online may affect the development of autonomy, risk tolerance, and social skills. If these early-life experiences shape long-term outlook, they could contribute to lower self-reported happiness.
Greater and earlier exposure to media: Younger cohorts were exposed to a media landscape characterized by continuous, personalized, and often negative content. Social media platforms amplify social comparison and negative content, while displacing in-person interaction. Increased awareness of global risks—including climate change—may contribute to a more pessimistic worldview.
Differential impact of economic conditions: Recent cohorts entered the labor market during periods of economic disruption, including the aftermath of the Global Financial Crisis and more recent pandemic-related shocks. These cohorts also face higher housing costs and greater student debt. Economic insecurity during the transition to adulthood may have lasting effects on well-being.
Extension of “liminal” adulthood: Young adults are taking longer to complete education, establish careers, form long-term partnerships, and have children. This extended unsettled period may be associated with lower life satisfaction.
Norms around self-reported well-being. Younger cohorts may also be less likely to say they are “very happy,” either because of changing norms around self-presentation or greater awareness of mental health.
It’s hard to say how much of the recent decline we can attribute to these causes. But the decline is steep, and seems to be ongoing.
How the Model Works
One of the challenges with this kind of survey data is that the sample size is small for each birth year in each iteration of the survey. If we plot raw percentages over time, the result is noisy.
In Probably Overthinking It, I addressed this problem by grouping respondents into decade-of-birth cohorts and smoothing the resulting time series. That approach works, but it has drawbacks: aggregation removes detail, introduces edge effects for the earliest and latest cohorts, and requires an arbitrary choice about the level of smoothing.
The new model takes a more principled approach. Instead of smoothing the observed data, it models an unobserved (latent) propensity to report being “very happy” for each cohort in each year.
We assume that the number of “very happy” responses in each group follows a binomial distribution, where the probability of a “very happy” response depends on this latent propensity. The observed responses provide noisy information about the latent factor; the model combines information across cohorts and years to estimate it.
The latent propensity is modeled as the sum of an intercept, representing the overall level of happiness, a smooth effect of birth cohort, a smooth effect of survey year, and a year-specific random effect that captures short-term fluctuations (overdispersion).
The cohort and period effects are modeled as second-order Gaussian random walks (RW2), which means the model assumes these effects evolve smoothly over time, with a preference for gradual changes in slope rather than abrupt jumps, unless the data provide strong evidence otherwise. The amount of smoothing is not fixed in advance; it is estimated from the data.
The random year effect captures variation from one survey to the next that is not explained by long-term trends, like current events and topics of discussion.
Where we have a lot of data, the estimates track the observed proportions closely. Where data are sparse, the model borrows strength from neighboring cohorts and years, providing principled smoothing and interpolation without arbitrary grouping.
In Chapter 9 of Probably Overthinking It I wrote about Drug Recognition Experts (DREs), who are law enforcement officers trained to recognize impaired drivers.
I reviewed the research papers that were supposed to evaluate the accuracy of DREs and I summarized my impressions like this:
What I found was a collection of studies that are, across the board, deeply flawed. Every one of them features at least one methodological error so blatant it would be embarrassing at a middle school science fair.
Recently the related topic of Field Sobriety Tests (FSTs) came up in this Reddit discussion, which links to this TV news report about sober drivers who were arrested based on FST results.
The TV report refers to this 2023 paper in JAMA Psychiatry. Because it’s recent, published in a good quality journal, and called “Evaluation of Field Sobriety Tests for Identifying Drivers Under the Influence of Cannabis: A Randomized Clinical Trial”, I thought it might address the problems I found in previous research.
Unfortunately, it has the same problems:
Selection bias: It excludes as subjects people with conditions that might cause them to fail an FST while sober – but these are exactly the people most vulnerable to false positive results.
Wrong metrics: The paper focuses on the true positive and false positive rates, and neglects the predictive value of the test – which is more relevant to the policy question.
Unrealistic base rate: In the test conditions, two thirds of the participants were impaired, which is almost certainly higher than the relevant fraction in the real world.
Despite all that, the false positive rate they reported is 49%, which means that nearly half of the sober participants were wrongly classified as impaired.
Let’s look at each of these problems more closely.
False Positives
The study tested 184 participants, 121 randomly assigned to the THC group and 63 to the placebo group. The THC group smoked cannabis cigarettes containing THC; the placebo group smoked cigarettes with almost none. Each participant was evaluated by one officer, who was “blinded to treatment assignment”. The paper reports
Officers classified 98 participants (81.0%) in the THC group and 31 (49.2%) in the placebo group as FST impaired.
The following table summarizes these results as a confusion matrix:
FST Positive
FST Negative
Total
THC Group
98
23
121
Placebo Group
31
32
63
Total
129
55
184
Let’s start with the most obvious problem: of 63 people in the placebo group, 31 were wrongly classified as impaired, so the false positive rate was 49%.
Although the tests “were administered by certified DRE instructors, the highest training level for impaired driving detection”, the results for sober participants were no better than a coin toss. That’s pretty bad, but in reality it’s probably worse, because of selection bias.
Selection Bias
The study recruited 261 people who met these requirements: “age 21 to 55 years, cannabis use 4 or more times in the past month, holding a valid driver’s license, and driving at least 1000 miles in the past year.”
But it excluded 62 recruits for reasons including “history of traumatic brain injury [and] significant medical conditions or psychiatric conditions”. They also excluded people with a positive urine test for nonprescription drugs or substance use disorder in the past year.
That’s a problem because people with these kinds of medical conditions are more likely to fail an FST – even if they are not actually impaired. By excluding them, the study excludes exactly the people most vulnerable to a false positive result.
A better experiment would recruit a representative sample of drivers, including people older than 55 and people with conditions that make it hard to pass a field sobriety test. The TV report highlights an example: an autistic man who was arrested for DUI because his autism-related differences were mistaken for impairment. I assume he would have been excluded from the study.
To see how much difference the selection criteria could make, suppose 20 of the excluded participants (about one third) had been assigned to the placebo group. And suppose that because of their conditions 16 of them were wrongly classified as impaired – that’s 80%, somewhat higher than the rate among included participants.
That would increase the number of false positives by 16 and the number of true negatives by 4, so the unbiased false positive rate might be 57%.
This is just a guess: it’s not clear how many were excluded specifically for medical conditions or how many of the excluded would have failed the FST. But this calculation gives us a sense of how big the bias could be.
As I wrote in Probably Overthinking It:
How can you estimate the number of false positives if you exclude from the study everyone likely to yield a false positive? You can’t.
And that brings us to the next problem.
Predictive Value
The paper reports:
Officers classified 98 participants (81.0%) in the THC group and 31 (49.2%) in the placebo group as FST impaired at the first evaluation
They quantify this difference as 31.8 percentage points, with 95% CI, 16.4-47.2 percentage points, and report a p-value < .001. Based on this analysis, they conclude:
FSTs administered by highly trained law enforcement officers differentiated between individuals receiving THC vs placebo
This conclusion is true in the sense that the difference in percentages is statistically significant, but the policy question is not whether THC exposure changes FST performance under laboratory conditions. The question is whether an FST result provides sufficiently strong evidence to justify detention or arrest.
For that, the false positive rate is relevant, and as we have discussed, it is probably more than 50%.
But even more important is the positive predictive value (PPV), which is the probability that a positive test is correct. In the confusion matrix, there are 129 positive tests, of which 98 are correct and 31 incorrect, so the PPV is 98 out of 129, about 76%.
Of the people who failed the FST, 76% were actually impaired. That might sound good enough for probable cause, but that conclusion is misleading because there is still another problem – the base rate.
Base Rate
In the study, two thirds of the participants were impaired. In the real world, it is unlikely that two thirds of drivers are impaired – or even two thirds of drivers who take an FST. So the base rate in the study is too high.
To see why that matters, we have to do a little math. First we’ll use the confusion matrix to compute one more metric, sensitivity, which is the percentage of impaired participants who were classified correctly.
We can use sensitivity, along with the false positive rate we already computed, to figure out the positive predictive value of a test with a more realistic base rate.
Of all people pulled over and given a field sobriety test, how many do you think are impaired by THC? That’s a hard question to answer, so we’ll try a couple of values.
First, suppose the base rate is one third, rather than the two thirds in the study. If we imagine 100 drivers:
If 33 are impaired, and sensitivity is 81%, we expect 27 true positive results.
If 67 are not impaired, and the false positive rate is 49%, we expect 33 false positive results.
In that case the positive predictive value is 27 / (27 + 33), which means that only 45% of positive tests are correct. If we put those numbers in a table, the calculation might be clearer.
Tests
Prob pos
Pos tests
Percent
Impaired
33
0.810
26.727
44.773
Not impaired
67
0.492
32.968
55.227
With a lower base rate, PPV is lower, which means that a positive test is weaker evidence of impairment. But even 45% might be too high.
If we suppose that 15% of drivers who take an FST are impaired, we can run the numbers again.
Tests
Prob pos
Pos tests
Percent
Impaired
15
0.810
12.149
22.508
Not impaired
85
0.492
41.825
77.492
With 15% base rate, the predictive value of the test is only 23% – which means 77% of drivers identified as impaired would actually be sober.
In reality, the base rate depends on the context. At a checkpoint where every driver is stopped, the base rate might be lower than 15%. If a driver is stopped for driving erratically, the base rate might be relatively high. But even then, it is unlikely to be as high as 66%, as in the study.
Discussion
The JAMA Psychiatry study provides valuable data, but it suffers from the same methodological problems as previous DRE validation studies:
High false positive rate: Nearly half of sober participants were incorrectly classified as impaired.
Selection bias: The study excluded exactly the people most likely to be falsely accused, making it impossible to assess the true false positive rate in the general population.
Unrealistic base rate: The base rate in the study is higher than what we expect in real-world use, which inflates the predictive value of the test.
Although I have been critical of the study, I agree with their interpretation of the results:
…the substantial overlap of FST impairment between groups and the high frequency at which FST impairment was suspected to be due to THC suggest that absent other indicators, FSTs alone may be insufficient to identify THC-specific driving impairment.
Emphasis mine.
Notes
In my interpretation of the results, I follow the methodology of the study, which treats assignment to the THC group as ground truth – that is, we assume that participants in the THC group were actually impaired and participants in the placebo group were not. And the paper reports:
Median self-reported highness (scale of 0 to 100, with higher scores indicating more impairment) at 30 minutes was 64 (IQR, 32-76) for the THC group and 13 (IQR, 1-28) for the placebo group (P < .001).
The THC group felt that they were more impaired, but based on the IQRs, it looks like there might be overlap. That complicates the interpretation of “impaired”, but for this analysis I use the study’s operational definition.
If you have studied probability, you might be familiar with fractional odds, which represent the ratio of the probability something happens to the probability it doesn’t. For example, if the Seahawks have a 75% chance of winning the Super Bowl, they have a 25% chance of losing, so the ratio is 75 to 25, sometimes written 3:1 and pronounced “three to one”.
But if you search for “the odds that the Seahawks win”, you will probably get moneyline odds, also known as American odds. Right now, the moneyline odds are -240 for the Seahawks and +195 for the Patriots. If you are not familiar with this format, that means:
If you bet $100 on the Patriots and they win, you gain $195 – otherwise you lose $100.
If you bet $240 on the Seahawks and they win, you gain $100 – otherwise you lose $240.
If you are used to fractional odds, this format might make your head hurt. So let’s unpack it.
Suppose you think the Patriots have a 25% chance of winning. Under that assumption, we can compute the expected value of the first wager like this:
If the Patriots actually have a 25% chance of winning, the first wager has negative expected value – so you probably don’t want to make it.
Now let’s compute the expected value of the second wager – assuming the Seahawks have a 75% chance of winning:
expected_value(p=0.75, wager=240, payout=100)
15.0
The expected value of this wager is positive, so you might want to make it – but only if you have good reason to think the Seahawks have a 75% chance of winning.
Implied Probability
More generally, we can compute the expected value of each wager for a range of probabilities from 0 to 1.
plt.plot(ps, ev_patriots, label='Bet on Patriots')
plt.plot(ps, ev_seahawks, label='Bet on Seahawks')
plt.axhline(0, color='gray', alpha=0.4)
decorate(xlabel='Actual probability Patriots win',
ylabel='Expected value of wager')
To find the crossover point, we can set the expected value to 0 and solve for p. This function computes the result:
Here’s crossover for a bet on the Patriots at the offered odds.
p1 = crossover(100, 195)
p1
0.3389830508474576
If you think the Patriots have a probability higher than the crossover, the first bet has positive expected value.
And here’s the crossover for a bet on the Seahawks.
p2 = crossover(240, 100)
p2
0.7058823529411765
If you think the Seahawks have a probability higher than this crossover, the second bet has positive expected value.
So the offered odds imply that the consensus view of the betting market is that the Patriots have a 33.9% chance of winning and the Seahawks have a 70.6% chance. But you might notice that the sum of those probabilities exceeds 1.
p1 + p2
1.0448654037886342
What does that mean?
The Take
The sum of the crossover probabilities determines “the take”, which is the share of the betting pool taken by “the house” – that is, the entity that takes the bets.
For example, suppose 1000 people take the first wager and bet $100 each on the Patriots. And 1000 people take the second wager and bet $240 on the Seahawks.
Here’s the total expected value of all of those wagers.
total = expected_value(ps, 100_000, 195_000) + expected_value(1-ps, 240_000, 100_000)
plt.plot(ps, total, label='Total')
plt.axhline(0, color='gray', alpha=0.4)
decorate(xlabel='Actual probability Patriots win',
ylabel='Total expected value of all wagers')
The total expected value is negative for all probabilities (or zero if the Patriots have no chance at all) – which means the house wins.
How much the house wins depends on the actual probability. As an example, suppose the actual probability is the midpoint of the probabilities implied by the odds:
p = (p1 + (1-p2)) / 2
p
0.31655034895314055
In that case, here’s the expected take, assuming that the implied probability is correct.
take = -expected_value(p, 100_000, 195_000) - expected_value(1-p, 240_000, 100_000)
take
14244.765702891316
As a percentage of the total betting pool, it’s a little more than 4%.
take / (100_000 + 240_000)
0.04189636971438623
Which we could have approximated by computing the “overround”, which is the amount that the sum of the implied probabilities exceeds 1.
(p1 + p2) - 1
0.04486540378863424
Don’t Bet
In summary, here are the reasons you should not bet on the Super Bowl:
If the implied probabilities are right (within a few percent) all wagers have negative expected value.
If you think the implied probabilities are wrong, you might be able to make a good bet – but only if you are right. The odds represent the aggregated knowledge of everyone who places a bet, which probably includes a lot of people who know more than you.
If you spend a lot of time and effort, you might find instances where the implied probabilities are wrong, and you might even make money in the long run. But there are better things you could do with your time.
Betting is a zero-sum game if you include the house and a negative-sum game for people who bet. If you make money, someone else loses – there is no net creation of economic value.
So, if you have the skills to beat the odds, find something more productive to do.
Some people have strong opinions about this question:
In a family with two children, if at least one of the children is a girl born on Tuesday, what are the chances that both children are girls?
In this article, I hope to offer
A solution to one interpretation of this question,
An explanation of why the solution seems so counterintuitive,
A discussion of other interpretations, and
An implication of this problem for teaching and learning probability.
Let’s get started.
One interpretation
One reason this problem is contentious is that it is open to multiple interpretations. I’ll start by presenting just one – then we’ll get back to the ambiguity.
First, to avoid real-world complications, let’s assume an imaginary world where:
Every family has two children.
50% of children are boys and 50% are girls.
All days of the week are equally likely birth days.
Genders and birth days are independent.
Second, we will interpret the question in terms of conditional probability; that is, we’ll compute P(B|A), where
A is “at least one of the children is a girl born on Tuesday”, and
B is “both children are girls”.
Under these assumptions and this interpretation, the answer is unambiguous – and it turns out to be 13/27 (about 48.1%).
But why?
This problem is counterintuitive because it elicits confusion between causation and evidence.
If a family has a girl born on a Tuesday, that does not cause the other child to be a girl.
But the fact that a family has a girl born on Tuesday is evidence that the other child is a girl.
To see why, imagine two families: the first has one girl and the other has ten girls. Suppose I choose one of the families at random, check to see whether they have a girl born on Tuesday, and find that they do.
Which family do you think I chose?
If I chose the family with one girl, the chance is only 1/7 (about 14%) that she was born on Tuesday.
If I chose the family with ten girls, the chance is about 79% that at least one of them was born on a Tuesday.
And that’s the key to understanding the problem:
A family with more than one girl is more likely to have one born on Tuesday. Therefore, if a family has a girl born on a Tuesday, it is more likely that they have more than one girl.
That’s the qualitative argument. Now we’ll make it quantitative – with Bayes’s Theorem.
Bayes’s Theorem
Let’s start with four kinds of two-child families.
The posterior probability of two girls is 13/27. As always, Bayes’s Theorem is the chainsaw that cuts through the knottiest problems in probability.
Other versions
Everything so far is based on the interpretation of the question as a conditional probability. But many people have pointed out that the question is ambiguous because it does not specify how we learn that the family has a girl born on a Tuesday.
This objection is valid:
The answer depends on how we get the information, and
The statement of the problem does not say how.
There are many versions of this problem that specify different ways you might learn that a family has a girl born on a Tuesday, and you might enjoy the challenge of solving them.
In general, if we specify the process that generates the data, we can use simulation, enumeration, or Bayes’s Theorem to compute the conditional probability given the data.
But what should we do if the data-generating process is not uniquely specified?
One option is to say that the question has no answer because it is ambiguous.
Another option is to specify a prior distribution of possible data-generating processes, compute the answer under each process, and apply the law of total probability.
Some of the people who choose the second option also choose a prior distribution so that the answer turns out to be 1/2. In my view, that is a correct answer to one interpretation, but that interpretation seems arbitrary – by choosing different priors, we can make the answer almost anything.
I prefer the interpretation I presented, because
I believe it is what was intended by the people who posed the problem,
It is consistent with the conventional interpretation of conditional probability,
It yields an answer that seems paradoxical at first, so it is an interesting problem,
The apparent paradox can be resolved in a way that sheds light on conditional probability and the idea of independent events.
So I think it’s a perfectly good problem – it’s just hard to express it unambiguously in natural language (as opposed to math notation).
But you don’t have to agree with me. If you prefer a different interpretation of the question, and it leads to a different answer, feel free to write a blog post about it.
What about independence?
I think the girl born on Tuesday carries a lesson about how we teach. In introductory probability, students often learn two ways to compute the probability of a conjunction. First they learn the easy way:
P(A and B) = P(A) P(B)
But they are warned that this only applies if A and B are independent. Otherwise, they have to do it the hard way:
P(A and B) = P(A) P(B|A)
But how to we know whether A and B are independent? Formally, they are independent if
P(B|A) = P(B)
So, in order to know which formula to use, you have to know P(B|A). But if you know P(B|A), you might as well use the second formula.
Rather than check independence by conditional probability, it is more common to assert independence by intuition. For example, if we flip two coins, we have a strong intuition that the outcomes are independent. And if the coins are known to fair, this intuition is correct. But if there is any uncertainty about the probability of heads, it is not.
The coin example – and Monty Hall, and Bertrand’s Boxes, and many more – demonstrate the real lesson of the girl born on Tuesday – our intuition for independence is wildly unreliable.
Which means we might want to rethink the way we teach it.
In general
Previously I wrote about a version of this problem where the girl is named Florida. In general, if we are given that a family has at least one girl with a particular property, and the prevalence of the property is p, we can use Bayes’s Theorem to compute the probability of two girls.
I’ll use SymPy to represent the priors and the probability p.
The following figure shows the probability of two girls as a function of the prevalence of the property.
xs = np.linspace(0, 1) ys = (xs-2) / (xs-4)
plt.plot(xs, ys) plt.xlabel('Prevalence of the property') plt.ylabel('Conditional probability of two girls')
If the property is rare – like the name Florida – the conditional probability is close to 1/2. If the property is common – like having a name – the conditional probability is close to 1/3.
Objections
Here are some objections to the “girl born on Tuesday” problem along with my responses.
You have to model the message, not just the event
Objection. The statement “at least one child is a girl born on Tuesday” should not be treated as a bare event in a probability space. It should be treated as the outcome of a random process that generates messages or facts we learn. Therefore, the probability space must include not only family composition, but also the mechanism by which that information is produced. Any solution that conditions only on the family outcomes is incomplete.
Response. I agree that if the problem is interpreted as conditioning on a message (something that is said, reported, or chosen from among several true statements), then the reporting mechanism matters and must be modeled explicitly. However, I don’t think such a mechanism is required in all cases. It is standard and meaningful to interpret a question as conditioning on an event – an extensional property of outcomes – without introducing an additional random variable for how the information was obtained. That is the interpretation I adopt here.
Without a specified selection rule, symmetry forces the answer to 1/2
Objection. If the problem does not specify how the information was obtained, then we must assume a symmetric rule for selecting which true statement is revealed. Under that assumption, conditioning on “at least one boy” or “at least one girl” must give the same answer, and applying the law of total probability forces the posterior probability to equal the prior. Therefore, the correct answer must be 1/2.
Response. This conclusion follows only if we assume that the conditioning is on a message chosen from a symmetric set of alternatives. Under that interpretation, the result does depend on the selection rule, and 1/2 is a valid answer for one particular choice of rule. But if the conditioning is on an event rather than a message, there is no requirement that different events form a symmetric partition or that the law of total probability be applied across them in this way. Under the event-based interpretation, the argument forcing 1/2 does not apply.
The problem is ambiguous and therefore has no answer
Objection. Because the problem does not specify how we learn that there is a girl born on Tuesday, it is fundamentally ambiguous. Since different interpretations lead to different answers, the question has no single correct solution.
Response. It’s true that the problem is ambiguous as stated in natural language. One option is to declare it unanswerable. Another is to resolve the ambiguity by adopting a conventional default interpretation. I choose the latter: I interpret the question as a conditional probability defined on an explicit probability model and make that interpretation clear by enumerating the sample space. Under that interpretation, the answer is unambiguous and, in my view, interesting and instructive – even if other interpretations lead to different answers.
You are changing the sampling procedure
Objection. Some people object that the 13/27 result comes from changing how families are selected. Conditioning on “at least one child is a girl born on Tuesday” oversamples families with more girls, so the conditional distribution no longer represents the original population of two-child families. From this perspective, the result feels like an artifact of biased sampling rather than a genuine probability update.
Response. That description is accurate, but it is not a flaw. Conditioning is biased sampling: evidence changes the distribution of outcomes. Families with more girls really are more likely to satisfy the condition, and the conditional probability reflects that fact.
The day of the week seems irrelevant
Objection. Tuesday has nothing to do with gender, so it feels wrong that adding this detail should change the probability. Since the day of the week does not cause a child to be a girl, it seems irrelevant to the question.
Response. This objection reflects a common confusion between causal independence and evidential relevance. While the day of the week does not cause the other child’s gender, it provides evidence about the number of girls in the family. Evidence can change probabilities even when there is no causal connection.
The result depends on unrealistic independence assumptions
Objection. The solution assumes that genders and days of the week are independent and uniformly distributed, which is not true in the real world. If those assumptions are relaxed, the answer changes.
Response. That is correct, but those assumptions are not the source of the puzzle. Relaxing them changes the numerical value of the answer, but not the underlying logic. The same kind of reasoning applies under more realistic models.
The problem is artificial or pathological
Objection. Some readers reject the problem not because the calculation is wrong, but because the setup feels artificial or unlike how information is learned in real life. From this view, the problem is a trick rather than a meaningful probability question.
Response. Whether this is a flaw or a feature depends on the goal. The problem is artificial, but it is intended to expose how unreliable our intuitions about conditional probability and independence can be. In that sense, its artificiality is what makes it pedagogically useful. The underlying issue – determining how evidence bears on hypotheses – comes up in real-world problems all the time. And getting it wrong has real-world consequences.
At PyData Global 2025 I presented a workshop on Bayesian Decision Analysis with PyMC. The video is available now.
This workshop is based on the first session of the Applied Bayesian Modeling Workshop I teach along with my colleagues at PyMC Labs. If you would like to learn more, it is not too late to sign up for the next offering, starting Monday January 12.
Here’s the abstract and description of the workshop.
Bayesian Decision Analysis with PyMC: Beyond A/B Testing
This hands-on tutorial introduces practical Bayesian inference using PyMC, focusing on A/B testing, decision-making under uncertainty, and hierarchical modeling. With real-world examples, you’ll learn how to build and interpret Bayesian models, evaluate competing hypotheses, and implement adaptive strategies like Thompson sampling. Whether you’re working in marketing, healthcare, public policy, UX design, or data science more broadly, these techniques offer powerful tools for experimentation, decision-making, and evidence-based analysis.
Description
Bayesian methods offer a natural and interpretable framework for updating beliefs with data, and PyMC makes it easy to apply these techniques in practice. In this tutorial, we’ll walk through a series of examples that demonstrate the core concepts:
Bayesian A/B Testing with the Beta-Binomial Model
Represent prior beliefs with the beta distribution
Use binomial likelihoods to model observed outcomes
Understand posterior distributions and credible intervals
Bayesian Bandits and Thompson Sampling
Go beyond hypothesis testing: estimate the probability of one version outperforming another
Use Thompson sampling to guide decision-making
Simulate and visualize an adaptive email campaign
Hierarchical Models for Partial Pooling and Prediction
Learn how to share information across variants
Use posterior predictive distributions to quantify uncertainty
Understand second-order probabilities
Hands-On Learning
Participants will follow along in Jupyter notebooks (hosted on Colab — no installation required). Exercises are embedded throughout, with guided solutions. Code is based on PyMC, ArviZ, and standard scientific Python libraries.
Prerequisites
Intermediate Python: basic familiarity with NumPy, plotting, and Jupyter notebooks
No prior experience with Bayesian statistics or PyMC is assumed
Suppose you are not sure whether all ravens are black. If you see a white raven, that clearly refutes the hypothesis. And if you see a black raven, that supports the hypothesis in the sense that it increases our confidence, maybe slightly. But what if you see a red apple – does that make the hypothesis any more or less likely?
This question is the core of the Raven paradox, a problem in the philosophy of science posed by Carl Gustav Hempel in the 1940s. It highlights a counterintuitive aspect of how we evaluate evidence and confirm hypotheses.
No resolution of the paradox is universally accepted, but the most widely accepted is what I will call the standard Bayesian response. In this article, I’ll present this response, explain why I think it is incomplete, and propose an extension that might resolve the paradox.
Logically, these hypotheses are identical – if A is true, B must be true, and vice versa. So if we have a certain level of confidence in A, we should have exactly the same confidence in B. And if we observe evidence in favor of A, we should also accept it as evidence in favor of B, to the same degree.
Also, if we accept that a black raven is evidence in favor of A, we should also accept that a non-black non-raven is evidence in favor of B.
Finally, if a non-black non-raven is evidence in favor of B, we should also accept that it is evidence in favor of A.
Therefore, a red apple (which is a non-black non-raven) is evidence that all ravens are black.
If you accept this conclusion, it seems like every time you see a red apple (or a blue car, or a green leaf, etc.) you should think, “Now I am slightly more confident that all ravens are black”.
But that seems absurd, so we have two options:
Discover an error in the argument, or
Accept the conclusion.
As you might expect, many versions of (1) and (2) have been proposed.
The standard Bayesian response is to accept the conclusion but, quoth Wikipedia “argue that the amount of confirmation provided is very small, due to the large discrepancy between the number of ravens and the number of non-black objects. According to this resolution, the conclusion appears paradoxical because we intuitively estimate the amount of evidence provided by the observation of a green apple to be zero, when it is in fact non-zero but extremely small.”
It is true that when the number of non-ravens is large, the amount of evidence we get from each non-black non-raven is so small it is negligible. But I don’t think that’s why the conclusion is so acutely counterintuitive.
To clarify my objection, let me present a smaller example I’ll call the Roulette paradox.
The Roulette Paradox
An American roulette wheel has 36 pockets with the numbers 1 to 36, and two pockets labeled 0 and 00. The non-zero pockets are red or black, and the zero pockets are green.
Suppose we work in quality control at the roulette factory and our job is to check that all zero pockets are green. If we observe a green zero, that’s evidence that all zeros are green. But what if we observe a red 19?
In this example, the standard Bayesian response fails:
First, the number of non-zeros is not particularly large, so the weight of the evidence is not negligible.
Also, the Bayesian response doesn’t address what I think is actually the key: The non-green non-zero may or may not be evidence, depending on how it was sampled.
As I will demonstrate,
If we choose a pocket at random and it turns out to be a non-green non-zero, that is not evidence that all zeros are green.
But if we choose a non-green pocket and it turns out to be non-zero, that is evidence that all zeros are green.
In both cases we observe a non-green non-zero, but “observe” is ambiguous. Whether the observation is evidence or not depends on the sampling process that generated the observation. And I think confusion between these two scenarios is the foundation of the paradox.
The Setup
Let’s get into the details. Switching from roulette back to ravens, we will consider four scenarios:
You choose a random thing and it turns out to be a black raven.
You choose a random thing and it turns out to be a non-black non-raven.
You choose a random raven and it turns out to be black.
You choose a random non-black thing and it turns out to be a non-raven.
The key to the raven paradox is the difference between scenarios 2 and 4.
Scenario 2 is what most people imagine when they picture “observing a red apple”. And in this scenario, the red apple is irrelevant, exactly as intuition insists.
In Scenario 4, a red apple is evidence in favor of A, because we’re systematically checking non-black things to ensure they’re not ravens – so finding they aren’t is confirmation. But this sampling process is a more contrived interpretation of “observing a red apple”.
The reason for the paradox is that we imagine Scenario 2 and we are given the conclusion from Scenario 4.
It might not be obvious why the red apple is evidence in Scenario 4, but not Scenario 2. I think it will be clearer if we do the math.
The Math
We’ll start with a small world where there are only N = 9 ravens and M = 19 non-ravens. Then we’ll see what happens as we vary N and M.
I’ll use i to represent the unknown number of black ravens, which could be any value from 0 to N, and j to represent the unknown number of black non-ravens, from 0 to M.
We’ll use a joint distribution to represent beliefs about i and j; then we’ll use Bayes’s Theorem to update these beliefs when we see new data.
Let’s start with a uniform prior over all possible combinations of (i, j). For this prior, the probability of A is 10%. We’ll see later that the prior affects the strength of the evidence, but it doesn’t affect whether an observation is in favor of A or not.
Scenario 1
Now let’s consider the first scenario: we choose a thing at random from the universe of things, and we find that it is a black raven.
The likelihood for this observation is: i / (N + M), because i is the number of black ravens and N + M is the total number of things.
In this scenario the posterior probability of A is 20%. The posterior probability is higher than the prior, so the black raven is evidence in favor of A.
To quantify the strength of the evidence, we’ll use the log odds ratio, which is 0.81. Later we’ll see how the strength of the evidence depends on the prior distribution of i and j.
Before we go on, let’s also look at the marginal distribution of i (number of black ravens) before and after.
As expected, observing a black raven increases our confidence that all ravens are black. The posterior distribution shifts toward higher values of i, and the probability that i = N increases.
In Scenario 1, the likelihood depends only on i, not on j, so the update doesn’t change our beliefs about j (the number of black non-ravens).
Finally, let’s visualize posterior joint distribution of i and j.
Because we started with a uniform distribution and the data has no bearing on j, the joint posterior probabilities don’t depend on j.
In summary, Scenario 1 is consistent with intuition: a black raven is evidence in favor of A.
Scenario 2
In this scenario, we choose a thing at random from the universe of N + M things, and it turns out to be a red apple – which we will treat generally as a non-black non-raven.
The likelihood of this observation is: (M - j) / (N + M), because M - j is the number of non-black non-ravens and N + M is the total number of things.
In this scenario, the posterior probability of A is the same as the prior. In fact, the entire distribution of i is unchanged.
So the red apple is not evidence in favor of A or against it. This is consistent with the intuition that the red apple (or any non-black non-raven) is irrelevant.
However, the red apple is evidence about j, as we can confirm by comparing the marginal distribution of j before and after.
And here’s the posterior joint distribution of i and j.
Because the red apple has no bearing on i, the posterior probabilities in this scenario don’t depend on i.
In summary, Scenario 2 matches our intuition: a red apple (chosen at random) is not evidence about whether all ravens are black.
Scenario 3
In this scenario, we choose a raven first and then observe that it is black.
The likelihood for this observation is: i / N, because i is the number of black ravens and N is the total number of ravens.
In this scenario, the posterior probability of A is 20%, the same as in Scenario 1.So we conclude that the black raven is evidence in favor of A, with the same strength regardless of whether we are in:
Scenario 1: Select a random thing and it turns out to be a black raven or
Scenario 3: Select a random raven and it turns out to be black.
In fact, the entire posterior distribution is the same in both scenarios. That’s because the likelihoods in Scenarios 1 and 3 differ only by a constant factor, which is removed when the posterior distributions are normalized.
In summary, Scenario 3 is consistent with intuition: if we choose a raven and find that it is black, that is evidence in favor of A.
Scenario 4
In the last scenario, we first choose a non-black thing (from all non-black things in the universe), and then observe that it is a non-raven.
The likelihood of this observation is: (M - j) / (N - i + M - j) because M - j is the number of non-black non-ravens and N - i + M - j is the total number of non-black things.
This likelihood depends on bothi and j, unlike Scenario 2. This is the key difference that makes Scenario 4 informative about whether all ravens are black.
The posterior probability of A is about 15%, which is greater than the prior, so the non-black non-raven is evidence in favor of A. The log odds ratio is about 0.47, which is smaller than in Scenarios 1 and 3, because there are more non-ravens than ravens. As we’ll see, the strength of the evidence gets smaller as M gets bigger.
Here is the marginal distribution of i (number of black ravens) before and after.
And here’s the marginal distribution of j (number of black non-ravens) before and after.
Finally, here’s the posterior joint distribution of i and j.
In Scenario 4, the likelihood depends on bothi and j, so the update changes our beliefs about both parameters.
And in Scenario 4 a non-black non-raven (chosen from non-black things) is evidence in favor of A. This might still be surprising, but let me suggest a way to think about it: in this scenario we are checking non-black things to make sure they are not ravens. If we find a non-black raven, that contradicts A. If we don’t, that supports A.
In all four scenarios, the results are consistent with intuition. So as long as you are clear about which scenario you are in, there is no paradox. The paradox is only apparent if you think you are in Scenario 2 and you imagine the result from Scenario 4.
In the context of the original problem:
If you walk out of your house and the first thing you see is a red apple (or a blue car, or a green leaf) that has no bearing on whether raven are black.
But if you deliberately select a non-black thing and check whether it’s a raven, and you find that it is not, that actually is evidence that all ravens are black – but consistent with the standard Bayesian response, it is so weak it is negligible.
Successive updates
In these examples, we started with a uniform prior over all combinations of i and j. Of course that’s not a realistic representation of what we believe about the world. So let’s consider the effect of other priors.
In general, different priors lead to different posterior distributions, and in this case they lead to different conclusions about the strength of the evidence. But they lead to the same conclusion about the direction of the evidence.
To demonstrate, let’s see what happens if we observe a series of black ravens (in Scenario 1 or 3). For simplicity, assume that we sample with replacement.
The following function computes multiple updates, starting with the uniform prior and then using the posterior from each update as the prior for the next.
This table shows the results in Scenario 1 (which is the same as in Scenario 3). For each iteration, the table shows the prior and posterior probability of A, and the log odds ratio.
Iteration
Prior
Posterior
LOR
0
0.100000
0.200000
0.810930
1
0.200000
0.284211
0.462624
2
0.284211
0.360000
0.348307
3
0.360000
0.427901
0.284942
4
0.427901
0.488715
0.245274
5
0.488715
0.543171
0.218261
6
0.543171
0.591920
0.198796
7
0.591920
0.635551
0.184196
8
0.635551
0.674590
0.172914
9
0.674590
0.709512
0.163995
As we see more ravens, the posterior probability of A increases, but the LOR decreases – which means that each raven provides weaker evidence than the previous one. In the long run the LOR converges to a value greater than 0 (about 0.11), which means that each raven provides at least some additional evidence, even when the prior is far from the uniform distribution we started with.
In the worst case, if the prior probability of A is 0 or 1, nothing we observe can change those beliefs, so nothing is evidence for or against A. But there is no prior where a black raven provides evidence againstA.
[Proof: The likelihood of the observation is maximized when all ravens are black (i = N). Therefore, for any prior that gives non-zero probability to both A and its complement, the LOR is positive: these observations can never be evidence against A.]
The following table shows the results in Scenario 4, where we select a non-black thing and check that it is not a raven.
Iteration
Prior
Posterior
LOR
0
0.100000
0.149403
0.457933
1
0.149403
0.201006
0.359272
2
0.201006
0.253991
0.302582
3
0.253991
0.307217
0.264273
4
0.307217
0.359496
0.235611
5
0.359496
0.409837
0.212911
6
0.409837
0.457528
0.194344
7
0.457528
0.502141
0.178860
8
0.502141
0.543477
0.165785
9
0.543477
0.581514
0.154644
The pattern is similar. Each non-black thing that turns out not to be a raven is weaker evidence than the previous one. But it is always in favor of A – in this scenario, there is no prior where a non-black non-raven is evidence against A.
Varying M
Finally, let’s see how the strength of the evidence varies as we increase M, the number of non-ravens. The following function computes results in Scenario 4 for a range of values of M, holding constant the number of ravens, N = 9.
M
Prior
Posterior
LOR
20
0.1
0.147655
0.444110
50
0.1
0.124515
0.246875
100
0.1
0.114530
0.151946
200
0.1
0.108495
0.091022
500
0.1
0.104100
0.044751
1000
0.1
0.102331
0.025640
As M increases (more non-ravens in the universe), the strength of the evidence decreases. This is consistent with the standard Bayesian response, which notes that in a realistic scenario, the evidence is negligible.
Conclusion
The standard Bayesian response to the Raven paradox is correct in the sense that if a non-black non-raven is evidence that all ravens are black, it is so extremely weak. But that doesn’t explain why the roulette example – where the number of non-green non-zero pockets is relatively small – is still so contrary to intuition.
I think a better explanation for the paradox is the ambiguity of the word “observe”. If we are explicit about the sampling process that generates the observation, we find that a non-black non-raven may or may not be evidence that all ravens are black.
Scenario 2: If we choose a random thing and find that it is a non-black non-raven, that is not evidence.
Scenario 4: If we choose a non-black thing and find that it is a non-raven, that is evidence.
The first case is entirely consistent with intuition. The second case is less obvious, but if we consider smaller examples like a roulette wheel, and do the math, it can be reconciled with intuition.
Confusion between these scenarios causes the apparent paradox, and clarity about the scenarios resolves it.
Symmetry and Asymmetry
It might still seem strange that a black raven is always evidence for A and B, but a non-black non-raven may or may not be, depending on the sampling process. If A and B are logically identical, and a black raven supports A, it’s still not clear why a non-black non-raven doesn’t always support B.
After all, if we start with B, we conclude that a non-black non-raven is always evidence for B (and A), and a black raven may or may not be. Where does this asymmetry come from?
We broke the symmetry when we formulated “All ravens are black” as “Out of all ravens, how many are black?” This formulation first divides the world into ravens and non-ravens, then asks how many in each group are black.
Conversely, if we start with “All non-black things are non-ravens”, we formulate it as “Out of all non-black things, how many are ravens?” In this formulation, we divide the world into black and non-black things, then ask how many in each group are ravens.
The asymmetry is apparent when we parameterize the models. If we start with A, we define i to be the number of ravens that are black. And we find that in Scenario 1, the likelihood of a black raven depends on i, and in Scenario 2, the likelihood of a non-black non-raven does not.
If we start with B, we define i to be the number of non-black things that are non-ravens. Then in Scenario 1 we find that a non-black non-raven pertains to i, but a black raven does not.
So the symmetry is broken when we formulate the hypothesis in a way that is testable with data. In propositional logic, A and B are equivalent in the sense that evidence for one must be evidence for the other. In the Bayesian formulation, “How many ravens are black?” and “How many non-black things are non-ravens?” are not equivalent; evidence for one is not necessarily evidence for the other.
A critic might say that the Bayesian formulation is a non-resolution – that is, it doesn’t solve the original problem posed by Hempel; it only solves a related problem by making additional assumptions.
A Bayesian response is that the Raven Paradox is only problematic in the abstract world of propositional logic; as soon as we formulate the question in a way that connects it to the real world through observation, it disappears. So the Raven Paradox is similar to the principle of explosion – it demonstrates a brittleness in propositional logic that makes it unsuitable for reasoning about many real-world hypotheses.
Related Reading
I am not the first to notice that the interpretation of evidence depends on a model of the data-generating process. In the context of the Raven Problem, Richard Royall wrote:
We see that the observation of a red pencil can be evidence that all ravens are black. To make the proper interpretation, we must have an additional piece of information. Whether the observation is or is not evidence supporting the hypothesis (A) that all ravens are black versus the hypothesis (B) that only a fraction … are black is determined by the sampling procedure. A randomly selected pencil that proves to be red is not evidence that all ravens are black, but a randomly selected red object that proves to be a pencil is.
Royall in his commentary on the Raven Paradox … observes that how one got the white shoes is inferentially important. If you grabbed a non-raven object at random, then it does not bear on the question of whether all ravens are black. If on the other hand you grabbed a random non-black object, and it turned out to be a pair of shoes, then it provides a very tiny amount of evidence for the hypothesis that all ravens are black …
Royall is right that the sampling process determines whether a red pencil (or white shoe) is evidence about ravens, and he analyzes a version of what I’m calling Scenario 4. But I don’t think his analysis quite explains why the paradox feels so counterintuitive, and it seems to have had little impact on the discussion of the Raven paradox in the confirmation theory literature.