I have not posted a new article for a few months, and I am now happy to announce the reason: I am working on a book! The working title is, of all things, Probably Overthinking It. If things go according to plan, it will be published by the University of Chicago Press in early 2023.
When I started, I thought it would be a collection of articles from this blog, but I ended up writing more new material than I expected. But it’s going well; out of 12 chapters, I have a solid draft of 10 and a pretty good idea what will go in the other two.
Over the next few months, I will publish excerpts here, but for now I am working at maximum speed to get a complete first draft to the technical reviewers.
So, sorry for going quiet, but it will be worth the wait!
If you would like to get infrequent email announcements about my book, please sign up below. I’ll let you know about milestones, promotions, and other news, but not more than one email per month. I will not share your email or use this list for any other purpose.
Data Structures and Information Retrieval in Python
I am happy to announce the first release of a new book, Data Structures and Information Retrieval in Python, which is an introduction to data structures organized around a motivating example: building a search engine.
The elements of the search engine are the Crawler, which downloads web pages and follows links to other pages, the Indexer, which builds a map from each word to the pages where it appears, and the Retriever, which looks up search terms and finds pages that maximize relevance and quality.
The index is stored in Redis, which is a NoSQL database that provides data structures like sets, lists, and hashmaps. The book presents each data structure first in Python, then in Redis, which should help readers see which features are essential and which are implementation details.
The book is loosely based on Think Data Structures, which uses Java. When I started, I thought it would be a translation, but the changes were so pervasive, I ended up writing it from scratch.
As I did with Think Bayes, I wrote this book entirely in Jupyter notebooks, and used JupyterBook to translate them to HTML. The notebooks run on Colab, which is a service provided by Google that runs notebooks in a browser. So you can read the book, run the code, and work on exercises without installing anything.
Why?
I have mixed feelings about data structures. In the computer science curriculum, it is often used as a weed-out class, and in the technical interview process, it is used as a gatekeeper. In both cases, it plays a disproportionate role in determining who studies computer science and who gets jobs in the field.
And it tends to be a bad class. The material is presented theoretically, without context or application, organized in a way that seems designed to maximize tedium, and often taught badly. The goals of the course are ambiguous — is it about theoretical foundations or programming proficiency? — and the result is often a hybrid that does nothing well.
Also, it has not kept up with changes in the software landscape. When people were learning to program in C, it might have been appropriate to focus on the implementation of data structures. But for people learning Python, it’s probably better to learn how to use them first, and see how they are implemented when (and if) necessary.
Despite my misgivings about the material, in 2016 I found myself designing a data structures curriculum for the Flatiron School, on behalf of a large software company. In 2017 I updated the curriculum and published Think Data Structures. But the software curriculum at Olin uses more Python than Java, so I have not used the Java version in a class.
Since I was scheduled to teach Data Structures this fall, I decided to write a Python version. I started last August with an outline and about 10 notebooks in various stages of completion. I developed the rest of the material during the semester, staying about a week ahead of the students.
Now there are 29 notebooks including 22 that present the material and 7 quizzes that include practice questions. I hope people will find it useful!
In the first edition of Think Bayes, I presented what I called the Geiger counter problem, which is based on an example in Jaynes, Probability Theory. But I was not satisfied with my solution or the way I explained it, so I cut it from the second edition.
I am re-reading Jaynes now, following the excellent series of videos by Aubrey Clayton, and this problem came back to haunt me. On my second attempt, I have a solution that is much clearer, and I think I can explain it better.
We have a radioactive source … which is emitting particles of some sort … There is a rate p, in particles per second, at which a radioactive nucleus sends particles through our counter; and each particle passing through produces counts at the rate θ. From measuring the number {c1 , c2 , …} of counts in different seconds, what can we say about the numbers {n1 , n2 , …} actually passing through the counter in each second, and what can we say about the strength of the source?
As a model of the source, Jaynes suggests we imagine “N nuclei, each of which has independently the probability r of sending a particle through our counter in any one second”. If N is large and r is small, the number of particles emitted in a given second is well modeled by a Poisson distribution with parameter s=Nr, where s is the strength of the source.
As a model of the sensor, we’ll assume that “each particle passing through the counter has independently the probability ϕ of making a count”. So if we know the actual number of particles, n, and the efficiency of the sensor, ϕ, the distribution of the count is Binomial(n,ϕ).
With that, we are ready to solve the problem. Following Jaynes, I’ll start with a uniform prior for s, over a range of values wide enough to cover the region where the likelihood of the data is non-negligible. To represent distributions, I’ll use the Pmf class from empiricaldist.
ss = np.linspace(0, 350, 101)
prior_s = Pmf(1, ss)
For each value of s, the distribution of n is Poisson, so we can form the joint prior of s and n using the poisson function from SciPy. The following function creates a Pandas DataFrame that represents the joint prior.
The result is a DataFrame with one row for each value of n and one column for each value of s.
To update the prior, we need to compute the likelihood of the data for each pair of parameters. However, in this problem the likelihood of a given count depends only on n, regardless of s, so we only have to compute it once for each value of n. Then we multiply each column in the prior by this array of likelihoods. The following function encapsulates this computation, normalizes the result, and returns the posterior distribution.
The following figures show the posterior marginal distributions of s and n.
posterior_s = marginal(posterior, 0)
posterior_n = marginal(posterior, 1)
The posterior mean of n is close to 109, which is consistent with Equation 6.116. The MAP is 99, which is one less than the analytic result in Equation 6.113, which is 100. It looks like the posterior probabilities for 99 and 100 are the same, but the floating-point results differ slightly.
Jeffreys prior
Instead of a uniform prior for s, we can use a Jeffreys prior, in which the prior probability for each value of s is proportional to 1/s. This has the advantage of “invariance under certain changes of parameters”, which is “the only correct way to express complete ignorance of a scale parameter.” However, Jaynes suggests that it is not clear “whether s can properly be regarded as a scale parameter in this problem.” Nevertheless, he suggests we try it and see what happens. Here’s the Jeffreys prior for s.
prior_jeff = Pmf(1/ss[1:], ss[1:])
We can use it to compute the joint prior of s and n, and update it with c1.
The posterior mean is close to 100 and the MAP is 91; both are consistent with the results in Equation 6.122.
Robot A
Now we get to what I think is the most interesting part of this example, which is to take into account a second observation under two models of the scenario:
Two robots, [A and B], have different prior information about the source of the particles. The source is hidden in another room which A and B are not allowed to enter. A has no knowledge at all about the source of particles; for all [it] knows, … the other room might be full of little [people] who run back and forth, holding first one radioactive source, then another, up to the exit window. B has one additional qualitative fact: [it] knows that the source is a radioactive sample of long lifetime, in a fixed position.
In other words, B has reason to believe that the source strength s is constant from one interval to the next, while A admits the possibility that s is different for each interval. The following figure, from Jaynes, represents these models graphically.
For A, the “different intervals are logically independent”, so the update with c2 = 16 starts with the same prior.
c2 = 16
posterior2 = update(joint, phi, c2)
Here’s the posterior marginal distribution of n2.
The posterior mean is close to 169, which is consistent with the result in Equation 6.124. The MAP is 160, which is consistent with Equation 6.123.
Robot B
For B, the “logical situation” is different. If we consider s to be constant, we can – and should! – take the information from the first update into account when we perform the second update. We can do that by using the posterior distribution of s from the first update to form the joint prior for the second update, like this:
The posterior mean of n is close to 137.5, which is consistent with Equation 6.134. The MAP is 132, which is one less than the analytic result, 133. But again, there are two values with the same probability except for floating-point errors.
Under B’s model, the data from the first interval updates our belief about s, which influences what we believe about n2.
Going the other way
That might not seem surprising, but there is an additional point Jaynes makes with this example, which is that it also works the other way around: Having seen c2, we have more information about s, which means we can – and should! – go back and reconsider what we concluded about n1.
We can do that by imagining we did the experiments in the opposite order, so
We’ll start again with a joint prior based on a uniform distribution for s
Update it based on c2,
Use the posterior distribution of s to form a new joint prior,
The posterior mean is close to 131.5, which is consistent with Equation 6.133. And the MAP is 126, which is one less than the result in Equation 6.132, again due to floating-point error.
Here’s what the new distribution of n1 looks like compared to the original, which was based on c1 only.
With the additional information from c2:
We give higher probability to large values of s, so we also give higher probability to large values of n1, and
The width of the distribution is narrower, which shows that with more information about s, we have more information about n1.
Discussion
This is one of several examples Jaynes uses to distinguish between “logical and causal dependence.” In this example, causal dependence only goes in the forward direction: “s is the physical cause which partially determines n; and then n in turn is the physical cause which partially determines c”.
Therefore, c1 and c2 are causally independent: if the number of particles counted in one interval is unusually high (or low), that does not cause the number of particles during any other interval to be higher or lower.
But if s is unknown, they are not logically independent. For example, if c1 is lower than expected, that implies that lower values of s are more likely, which implies that lower values of n2 are more likely, which implies that lower values of c2 are more likely.
And, as we’ve seen, it works the other way, too. For example, if c2 is higher than expected, that implies that higher values of s, n1, and c1 are more likely.
If you find the second result more surprising – that is, if you think it’s weird that c2 changes what we believe about n1 – that implies that you are not (yet) distinguishing between logical and causal dependence.
(a) Math-proficient women disproportionately prefer careers in non–math-intensive fields and are more likely to leave math-intensive careers as they advance;
(b) more men than women score in the extreme math-proficient range on gatekeeper tests, such as the SAT Mathematics and the Graduate Record Examinations Quantitative Reasoning sections;
(c) women with high math competence are disproportionately more likely to have high verbal competence, allowing greater choice of professions; and
(d) in some math-intensive fields, women with children are penalized in promotion rates.
To people familiar with this area of research, none of these are surprising, but the third caught my attention because I recently looked at the correlation between math and verbal scores on the SAT and ACT. In general, they are highly correlated, with r around 0.7, and they are equally correlated for men and women. So I was curious to know where this claim comes from and, if it is true, how big a factor it might be.
As evidence, Ceci et al. summarize results from “a tracking study of 1,100 high-mathematics aptitude students who expressed a goal of majoring in mathematics or science in college”, which found:
One determinant of who switched out of math/science fields was the asymmetry between their verbal and mathematics abilities. Women’s verbal abilities on average were nearly as strong as their mathematics abilities (only 61 points difference between their SAT-V and SAT-M), leading them to enter professions that prized verbal reasoning (e.g., law), whereas men’s verbal abilities were an average of 115 points lower than their mathematics ability, possibly leading them to view mathematics as their only strength.”
I don’t have access to that dataset, but I ran a similar analysis with data from the National Longitudinal Survey of Youth 1997 (NLSY97), which “follows the lives of a sample of [8,984] American youth born between 1980-84”. The public data set includes the participants’ scores on several standardized tests, including the SAT and ACT. Assuming that most participants took these exams when they were 17, they probably took them between 1997 and 2001.
I found that the pattern described by Ceci et al. also appears in this dataset. Although the correlation between math and verbal scores is the same for men and women, the slope of the regression line is not. In a group of male and female test-takers with the same math score, the verbal scores for the female test-takers are higher, on average. Near the high end of the range, the difference is about 35 points, which is a little smaller than the difference in the previous study, 54 points.
So we might ask:
Is this a big enough difference that it seems likely to affect career choices? For example, suppose Student A has scores M 750 V 660 and Student B has scores M 750 V 690. Would A be substantially more likely than B to “view mathematics as their only strength”?
If we assume that the answer is yes, and that both students make career choices accordingly, how big an effect would this have on the sex ratios we see in math-intensive fields?
I don’t have the data to answer the first question, but we can use the data we have, and a model of the filtering processes, to put an upper bound on the second.
To summarize the results, the largest effect I found for factor (c) is that it might increase the sex ratio in a math-intensive field by 7-14%. For example, if the requirement for a math-intensive job is 700 or more on the SAT math section, the sex ratio among the people who meet this requirement is 1.8. Now suppose that everyone who meets this standard takes a math-intensive job, EXCEPT the people who also get 700 or more on the verbal section. If all of those people choose a different career, the sex ratio of the ones left in the math-intensive job goes up to 2.0. In this part of the range, the effect of factor (c) is non-negligible.
To see what happens as we move farther into the tail, I used the NLSY data to create a Gaussian model, and used the model to simulate test scores beyond the range of the SAT. With this model, we see that the effect of factor (b) increases as we make the requirements stricter.
For example, if the threshold score is 800 for the math and verbal sections, the sex ratio among the people who meet the math requirement is 4.6. If the people who meed the verbal requirement choose different careers, the sex ratio among the people left behind is 4.9 (an increase of about 7%). So it seems like the effect of factor (c) gets smaller as we go farther into the tails of the distributions.
Finally, I use the model to decompose two parts of factor (b), the difference in means and the difference in variance. When the threshold is 800, the contribution of these two parts is about equal; that is:
If we set the means to be the same, but preserve the difference in variance, the sex ratio among people who meet the math requirement is about 2.2.
If we set the variances to be the same, but preserve the difference in means, the sex ratio among people who meet the math requirement is about 2.2.
In summary:
In a simple model, the effect of factor (c) is modest; in reality, it is likely to be smaller.
In the same model, the effect of factor (b) is substantial, and can be decomposed into roughly equal contributions from differences in means and variances.
The inspection paradox (aka length-biased sampling) is one of my favorite topics, and it turns out to be useful in the fight against COVID-19.
During the pandemic, you have probably heard about about the effective reproduction number, R, which is the average number of people infected by each infected person. R is important because it determines the large-scale course of the epidemic. As long as R is greater than 1, the number of cases will grow exponentially; if we find ways to drive R below 1, the number of cases will dwindle toward zero.
However, R is an average, and the average is not the whole story. With COVID-19, like many other epidemics, there is a lot of variation around the average.
According to a news feature in Nature, “One study in Hong Kong found that 19% of cases of COVID-19 were responsible for 80% of transmission, and 69% of cases didn’t transmit the virus to anyone.” In other words, most infections are caused by a small number of superspreaders.
This observation suggests a strategy for contact tracing. When an infected patient is discovered, it is common practice to identify people they have been in contact with who might also be infected. “Forward tracing” is intended to find people the patient might have infected; “backward tracing” is intended to find the person who infected the patient.
Now suppose you are a public health officer trying to slow or stop the spread of a communicable disease. Assuming that you have limited resources to trace contacts and test for the disease — and that’s a pretty good assumption — which do you think would be more effective, forward or backward tracing?
The inspection paradox suggests that backward tracing is more likely to discover a superspreader and the people they have infected.
According to the Nature article, “[Backward tracing] is extremely effective for the coronavirus because of its propensity to be passed on in superspreading events […] Any new case is more likely to have emerged from a cluster of infections than from one individual, so there’s value in going backwards to find out who else was linked to that cluster.”
To quantify this effect, let’s suppose that 70% of infected people don’t infect anyone else, as in the Hong Kong study, and the other 30% infect between 1 and 15 other people, uniformly distributed. The average of this distribution is 2.4, which is a plausible value of R.
Now suppose we discover an infected patient, trace forward, and find someone the patient infected. On average, we expect this person to infect 2.4 other people.
But if we trace backward and find the person who infected the patient, we are more likely to find someone who has infected a lot of people, and less likely to find someone who has only infected a few. In fact, the probability that we find any particular spreader is proportional to the number of people they have infected.
By simulating this sampling process, we can compute the distribution we would see by backward tracing. The average of this biased distribution is 10.1, more than four times the average of the unbiased distribution. This result suggests that backward tracing can discover four times more cases than forward tracing, given the same resources.
This example is not just theoretical; Japan adopted this strategy in February 2020. As Michael Lewis describes in The Premonition:
“When the Japanese health authorities found a new case, they did not waste their energy asking the infected person for a list of contacts over the previous few days, to determine whom the person might have infected in turn. […] Instead, they asked for a list of people with whom the infected person had interacted with further back in time. Find the person who had infected the newly infected person and you might have found a superspreader. Find a superspreader and you could track down the next superspreader before [they] really got going.”
So the inspection paradox is not always a nuisance; sometimes we can use it to our advantage.
This article is an excerpt from a new book I am working on, Probably Overthinking It: The puzzles and paradoxes of probability.
I’ve been enjoying Aubrey Clayton’s new book Bernoulli’s Fallacy. The first chapter, which is about the historical development of competing definitions of probability, is worth the price of admission alone.
One of the examples in Chapter 1 is a simplified version of a problem posed by Thomas Bayes. The original version, which I wrote about here, involves a billiards (pool) table; Clayton’s version uses dice:
Your friend rolls a six-sided die and secretly records the outcome; this number becomes the target T. You then put on a blindfold and roll the same six-sided die over and over. You’re unable to see how it lands, so each time your friend […] tells you only whether the number you just rolled was greater than, equal to, or less than T.
Suppose in one round of the game we had this sequence of outcomes, with G representing a greater roll, L a lesser roll, and E an equal roll:
G, G, L, E, L, L, L, E, G, L
Clayton, Bernoulli’s Fallacy, pg 36.
Based on this data, what is the posterior distribution of T?
Computing likelihoods
There are two parts of my solution; computing the likelihood of the data under each hypothesis and then using those likelihoods to compute the posterior distribution of T.
To compute the likelihoods, I’ll demonstrate one of my favorite idioms, using a meshgrid to apply an operation, like >, to all pairs of values from two sequences.
In this case, the sequences are
hypos: The hypothetical values of T, and
outcomes: possible outcomes each time we roll the dice
hypos = [1,2,3,4,5,6]
outcomes = [1,2,3,4,5,6]
If we compute a meshgrid of outcomes and hypos, the result is two arrays.
import numpy as np
O, H = np.meshgrid(outcomes, hypos)
The first contains the possible outcomes repeated down the columns.
The result is the probability that the outcome is greater than T, for each hypothetical value of T. I’ll name this array gt:
gt = (O > H).mean(axis=1)
The first element of the array is 5/6, which indicates that if T is 1, the probability of exceeding it is 5/6. The second element is 2/3, which indicates that if T is 2, the probability of exceeding it is 2/3. And do on.
Now we can compute the corresponding arrays for less than and equal.
In the next section, we’ll use these arrays to do a Bayesian update.
The Update
In this example, computing the likelihoods was the hard part. The Bayesian update is easy. Since T was chosen by rolling a fair die, the prior distribution for T is uniform. I’ll use a Pandas Series to represent it.
import pandas as pd
pmf = pd.Series(1/6, hypos)
pmf
As an aside, you might have noticed that the values in eq are all the same. So when the value we roll is equal to T, we don’t get any new information about T. We could leave the instances of eq out of the data, and we would get the same answer.
Suppose you meet someone who looks like the brother of your friend Mary. You ask if he has a sister named Mary, and he says “Yes I do, but I don’t think I know you.”
You remember that Mary has a sister who is left-handed, but you don’t remember her name. So you ask your new friend if he has another sister who is left-handed.
If he does, how much evidence does that provide that he is the brother of your friend, rather than a random person who coincidentally has a sister named Mary and another sister who is left-handed? In other words, what is the Bayes factor of the left-handed sister?
Let’s assume:
Out of 100 families with children, 20 have one child, 30 have two children, 40 have three children, and 10 have four children.
All children are either boys or girls with equal probability, one girl in 10 is left-handed, and one girl in 100 is named Mary.
Name, sex, and handedness are independent, so every child has the same probability of being a girl, left-handed, or named Mary.
If the person you met had more than one sister named Mary, he would have said so, but he could have more than one sister who is left handed.
I am not the first person to observe that it sometimes takes several tries to plug in a USB connector (specifically the rectangular Type A connector, which is not reversible). There are memes about it, there are cartoons about it, and on sites like Quora, peoplehaveaskedaboutit more than a few times.
But I might be the first to use Bayesian decision analysis to figure out the optimal strategy for plugging in a USB connector. Specifically, I have worked out how long you should try on the first side before flipping, how long you should try on the second side before flipping again, how long you should try on the third side, and so on.
Is Simpson’s Paradox just a mathematical curiosity, or does it happen in real life? And if it happens, what does it mean? To answer these questions, I’ve been searching for natural examples in data from the General Social Survey (GSS).
A few weeks ago I posted this article, where I group GSS respondents by their decade of birth and plot changes in their opinions over time. Among questions related to faith in humanity, I found several instances of Simpson’s paradox; for example, in every generation, people have become more optimistic over time, but the overall average is going down over time. The reason for this apparent contradiction is generational replacement: as old optimists die, they are being replaced by young pessimists.
In this followup article, I group people by level of education and plot their opinions over time, and again I found several instances of Simpson’s paradox. For example, at every level of education, support for legal abortion has gone down over time (at least under some conditions). But the overall level of support has increased, because over the same period, more people have achieved higher levels of education.
In the most recent article, I group people by decade of birth again, and plot their opinions as a function of age rather than time. I found some of the clearest instances of Simpson’s paradox so far. For example, if we plot support for interracial marriage as a function of age, the trend is downward; older people are less likely to approve. But within every birth cohort, support for interracial marriage increases as a function of age.
With so many examples, we are starting to see a pattern:
Examples of Simpson’s paradox are confusing at first because they violate our expectation that if a trend goes in the same direction in every group, it must go in the same direction when we put the groups together.
But now we realize that this expectation is naive: mathematically, it does not have to be true, and in practice, there are several reasons it can happen, including generational replacement and period effects.
Once explained, the examples we’ve seen so far have turned out not to be very informative. Rather than revealing useful information about the world, it seems like Simpson’s paradox is most often a sign that we are not looking at the data in the most effective way,
But before I give up, I want to give it one more try.
A more systematic search
Each example of Simpson’s paradox involves three variables:
On the x-axis, I’ve put time, age, and a few other continuous variables.
On the y-axis, I’ve put the fraction of people giving the most common response to questions about opinions, attitudes, and world view.
And I have grouped respondents by decade of birth, age, sex, race, religion, and several other demographic variables.
At this point I have tried a few thousand combinations and found about ten clear-cut instances of Simpson’s paradox. So I’ve decided to make a more systematic search. From the GSS data I selected 119 opinion questions that were asked repeatedly over more than a decade, and 12 demographic questions I could sensibly use to group respondents.
With 119 possible variables on the x-axis, the same 119 possibilities on the y-axis, and 12 groupings, there are a 84,118 sensible combinations. When I tested them, 594 produced computational errors of some kind, in most cases because some variables have logical dependencies on others. Among the remaining combinations, I found 19 instances of Simpson’s paradox.
So one conclusion we can reach immediately is that Simpson’s paradox is rare in the wild, at least with data of this kind. But let’s look more closely at the 19 examples.
Many of them turn out to be statistical noise. For example, the following figure shows responses to a question about premarital sex on the y-axis, responses to a question about confidence in the press on the x-axis, with respondents grouped by political alignment.
As confidence in the press declines from left to right, the overall fraction of people who think premarital sex is “not wrong at all” declines slightly. But within each political group, there is a slight increase.
Although this example meets the requirements for Simpson’s paradox, it is unlikely to mean much. Most of these relationships are not statistically significant, which means that if the GSS had randomly sampled a different group of people, it is plausible that these trends might have gone the other way.
And this should not be surprising. If there is no relationship between two variables in reality, the actual trend is zero and the trend we see in a random sample is equally likely to be positive or negative. Under this assumption, we can estimate the probability of seeing a Simpson paradox by chance:
If the overall trend is positive, the trend in all three groups has to be negative, which happens one time in eight.
If the overall trend is negative, the trend in all three groups has to be positive, which also happens one time in eight.
When there are more groups, Simpson’s paradox is less likely to happen by chance. Even so, since we tried so many combinations, it is only surprising that we did find more.
A few good examples
Most of the examples I found are like the previous one. The relationships are so weak that the trends we see are mostly random, which means we don’t need a special explanation for Simpson’s paradox. But I found a few examples where the Simpsonian reversal is probably not random and, even better, it makes sense.
For example, the following figure shows the fraction of people who would support a gun law as a function of how often they pray, grouped by sex.
Within each group, the overall trend is downward: the more you pray, the less likely you are to favor gun control. But the overall trend goes the other way: people who pray more are more likely to support gun control. Before you proceed, see if you can figure out what’s going on.
At this point you might guess that there is a correlation of some kind between the variable on the x-axis and the groups. In this example, there is a substantial difference in how much men and women pray. The following figure shows how much:
And that’s why average support for gun control increases as a function of prayer:
The low-prayer groups are mostly male, so average support for gun control is closer to the male response, which is lower.
The high-prayer groups are mostly female, so the overall average is closer to the female response, which is higher.
On one hand, this result is satisfying because we were able to explain something surprising. But having made the effort, I’m not sure we have learned much. Let’s look at one more example.
The GSS includes the following question about a hypothetical open housing law:
Suppose there is a community-wide vote on the general housing issue. There are two possible laws to vote on. One law says that a homeowner can decide for himself whom to sell his house to, even if he prefers not to sell to [someone because of their race or color]. The second law says that a homeowner cannot refuse to sell to someone because of their race or color. Which law would you vote for?
The following figure shows the fraction of people who would vote for the second law, grouped by race and plotted as a function of income (on a log scale).
In every group, support for open housing increases as a function of income, but the overall trend goes the other way: people who make more money are less likely to support open housing.
At this point, you can probably figure out why:
White respondents are less likely to support this law than Black respondents and people of other races, and
People in the higher income groups are more likely to be white.
So the overall average in the lower income groups is closer to the non-white response; the overall average in the higher income groups is closer to the white response.
Summary
Is Simpson’s paradox a mathematical curiosity, or does it happen in real life?
Based on my exploration (and a similar search in a different dataset), if you go looking for Simpson’s paradox in real data, you will find it. But it is rare: I tried almost 100,000 combinations, and found only about 100 examples. And a large majority of the examples I found were just statistical noise.
What does Simpson’s paradox tell us about the data, and about the world?
In the examples I found, Simpson’s paradox doesn’t reveal anything about the world that is useful to know. Mostly it creates confusion, especially for people who have not encountered it before. Sometimes it is satisfying to figure out what’s going on, but if you create confusion and then resolve it, I am not sure you have made net progress. If Simpson’s paradox is useful, it is as a warning that the question you are asking and the way you are looking at the data don’t quite go together.
A recent post on Reddit asks, “Why was Bayes’ Theory not accepted/popular historically until the late 20th century?”
Great question! As always, there are many answers to a question like this, and the good people of Reddit provide several. But the first and most popular answer is, in my humble opinion, wrong.
The story goes something like this: “Bayesian methods are computationally expensive, so even though they were known in the early days of modern statistics, they were not practical until the availability of computational power and the recent development of efficient sampling algorithms.”
This theory is appealing because, if we look at problems where Bayesian methods are currently used, many of them are large and complex, and would indeed have been impractical to solve just a few years ago.
I think it is also appealing because it rationalizes the history of statistics. Ignoring Bayesian methods for almost 100 years wasn’t a mistake, we can tell ourselves; we were just waiting for the computers to catch up.
Well, I’m sorry, but that’s bunk. In fact, we could have been doing Bayesian statistics all along, using conjugate priors and grid algorithms.
Conjugate Priors
A large fraction of common, practical problems in statistics can be solved using conjugate priors, and the solutions require almost no computation. For example:
Problems that involve estimating proportions can be solved using a beta prior and binomial likelihood function. In that case, a Bayesian update requires exactly two addition operations.
In the multivariate case, with a Dirichlet prior and a multinomial likelihood function, the update consists of adding two vectors.
Problems that involve estimating rates can be solved with a gamma prior and an exponential or Poisson likelihood function — and the update requires two additions.
For problems that involve estimating the parameters of a normal distribution, things are a little more challenging: you have to compute the mean and standard deviation of the data, and then perform about a dozen arithmetic operations.
For details, see Chapter 18 of Think Bayes. And for even more examples, see this list of conjugate priors. All of these could have been done with paper and pencil, or chalk and rock, at any point in the 20th century.
And these methods would be sufficient to solve many common problems in statistics, including everything covered in an introductory statistics class, and a lot more. In the time it takes for students to understand p-values and confidence intervals, you could teach them Bayesian methods that are more interesting, comprehensible, and useful.
In terms of computational efficiency, updates with prior conjugates border on miraculous. But they are limited to problems where the prior and likelihood can be well modeled by simple analytic functions. For other problems, we need other methods.
Grid Algorithms
The idea behind grid algorithms is to enumerate all possible values for the parameters we want to estimate and, for each set of parameters:
Compute the prior probability,
Compute the likelihood of the data,
Multiply the priors and the likelihoods,
Add up the products to get the total probability of the data, and
Divide through to normalize the posterior distribution.
If the parameters are continuous, we approximate the results by evaluating the prior and likelihood at a discrete set of values, often evenly spaced to form a d-dimensional grid, where d is the number of parameters.
If there are n possible values and m elements in the dataset, the total amount of computation we need is proportional to the product n m, which is practical for most problems. And in many cases we can do even better by summarizing the data; then the computation we need is proportional to n + m.
For problems with 1-2 parameters — which includes many useful, real-world problems — grid algorithms are efficient enough to run on my 1982 vintage Commodore 64.
For problems with 3-4 parameters, we need a little more power. For example, in Chapter 15 of Think Bayes I solve a problem with 3 parameters, which takes a few seconds on my laptop, and in Chapter 17 I solve a problem that takes about a minute.
For more than six parameters, grid algorithms are not practical at all. But you can solve a lot of real-world problems with fewer than six parameters, using only the computational power that’s been available since 1970.
So why didn’t we?
Awful People, Bankrupt Ideas
In 1925, R.A. Fisher wrote, “… it will be sufficient … to reaffirm my personal conviction … that the theory of inverse probability is founded upon an error, and must be wholly rejected.” By “inverse probability”, he meant what is now called Bayesian statistics, and this is probably the nicest thing he ever wrote about it.
For details of the story, I recommend The Theory That Would Not Die, where Sharon Bertsch McGrayne writes: “If Bayes’ story were a TV melodrama, it would need a clear-cut villain, and Fisher would probably be the audience’s choice by acclamation.”
Among other failings, Fisher feuded endlessly with Karl Pearson, Egon Pearson, and Jerzy Neyman, to the detriment of statistics, science, and the world. But he and Neyman agreed about one thing: they were both rabid and influential anti-Bayesians.
The focus of their animosity was the apparent subjectivity of Bayesian statistics, particularly in the choice of prior distributions. But this concern is, in my personal conviction, founded upon an error: the belief that frequentist methods are less subjective than Bayesian methods.
All statistical methods are based on modeling decisions, and modeling decisions are subjective. With Bayesian methods, the modeling decisions are represented more explicitly, but that’s a feature, not a bug. As I.J. Good said, “The subjectivist [Bayesian] states his judgements, whereas the objectivist [frequentist] sweeps them under the carpet by calling assumptions knowledge, and he basks in the glorious objectivity of science.”
In summary, it would be nice to think it was reasonable to neglect Bayesian statistics for most of the 20th century because we didn’t have the computational power to make them practical. But that’s a rationalization. A much more substantial part of the reason is the open opposition of awful people with bankrupt ideas.