August 2021 - Probably Overthinking It

Factors of Underrepresentation

August 25, 2021 AllenDowney

I recently encountered a 2009 paper by Ceci, Williams, and Barnett, “Women’s underrepresentation in science: sociocultural and biological considerations”, which lists in the abstract these “factors unique to underrepresentation [of women] in math-intensive fields”:

(a) Math-proficient women disproportionately prefer careers in non–math-intensive fields and are more likely to leave math-intensive careers as they advance;
(b) more men than women score in the extreme math-proficient range on gatekeeper tests, such as the SAT Mathematics and the Graduate Record Examinations Quantitative Reasoning sections;
(c) women with high math competence are disproportionately more likely to have high verbal competence, allowing greater choice of professions; and
(d) in some math-intensive fields, women with children are penalized in promotion rates.

To people familiar with this area of research, none of these are surprising, but the third caught my attention because I recently looked at the correlation between math and verbal scores on the SAT and ACT. In general, they are highly correlated, with r around 0.7, and they are equally correlated for men and women. So I was curious to know where this claim comes from and, if it is true, how big a factor it might be.

As evidence, Ceci et al. summarize results from “a tracking study of 1,100 high-mathematics aptitude students who expressed a goal of majoring in mathematics or science in college”, which found:

One determinant of who switched out of math/science fields was the asymmetry between their verbal and mathematics abilities. Women’s verbal abilities on average were nearly as strong as their mathematics abilities (only 61 points difference between their SAT-V and SAT-M), leading them to enter professions that prized verbal reasoning (e.g., law), whereas men’s verbal abilities were an average of 115 points lower than their mathematics ability, possibly leading them to view mathematics as their only strength.”

And they cite Achter, Lubinski, Benbow, & Eftekhari-Sanjani, 1999 and Wai, Lubinski, & Benbow, 2005.

I don’t have access to that dataset, but I ran a similar analysis with data from the National Longitudinal Survey of Youth 1997 (NLSY97), which “follows the lives of a sample of [8,984] American youth born between 1980-84”. The public data set includes the participants’ scores on several standardized tests, including the SAT and ACT. Assuming that most participants took these exams when they were 17, they probably took them between 1997 and 2001.

I found that the pattern described by Ceci et al. also appears in this dataset. Although the correlation between math and verbal scores is the same for men and women, the slope of the regression line is not. In a group of male and female test-takers with the same math score, the verbal scores for the female test-takers are higher, on average. Near the high end of the range, the difference is about 35 points, which is a little smaller than the difference in the previous study, 54 points.

So we might ask:

Is this a big enough difference that it seems likely to affect career choices? For example, suppose Student A has scores M 750 V 660 and Student B has scores M 750 V 690. Would A be substantially more likely than B to “view mathematics as their only strength”?
If we assume that the answer is yes, and that both students make career choices accordingly, how big an effect would this have on the sex ratios we see in math-intensive fields?

I don’t have the data to answer the first question, but we can use the data we have, and a model of the filtering processes, to put an upper bound on the second.

To summarize the results, the largest effect I found for factor (c) is that it might increase the sex ratio in a math-intensive field by 7-14%. For example, if the requirement for a math-intensive job is 700 or more on the SAT math section, the sex ratio among the people who meet this requirement is 1.8. Now suppose that everyone who meets this standard takes a math-intensive job, EXCEPT the people who also get 700 or more on the verbal section. If all of those people choose a different career, the sex ratio of the ones left in the math-intensive job goes up to 2.0. In this part of the range, the effect of factor (c) is non-negligible.

To see what happens as we move farther into the tail, I used the NLSY data to create a Gaussian model, and used the model to simulate test scores beyond the range of the SAT. With this model, we see that the effect of factor (b) increases as we make the requirements stricter.

For example, if the threshold score is 800 for the math and verbal sections, the sex ratio among the people who meet the math requirement is 4.6. If the people who meed the verbal requirement choose different careers, the sex ratio among the people left behind is 4.9 (an increase of about 7%). So it seems like the effect of factor (c) gets smaller as we go farther into the tails of the distributions.

Finally, I use the model to decompose two parts of factor (b), the difference in means and the difference in variance. When the threshold is 800, the contribution of these two parts is about equal; that is:

If we set the means to be the same, but preserve the difference in variance, the sex ratio among people who meet the math requirement is about 2.2.
If we set the variances to be the same, but preserve the difference in means, the sex ratio among people who meet the math requirement is about 2.2.

In summary:

In a simple model, the effect of factor (c) is modest; in reality, it is likely to be smaller.
In the same model, the effect of factor (b) is substantial, and can be decomposed into roughly equal contributions from differences in means and variances.

The details of this analysis are in this Jupyter notebook, which you can run on Colab.

COVID-19 and the Inspection Paradox

August 19, 2021 AllenDowney

The inspection paradox (aka length-biased sampling) is one of my favorite topics, and it turns out to be useful in the fight against COVID-19.

During the pandemic, you have probably heard about about the effective reproduction number, R, which is the average number of people infected by each infected person. R is important because it determines the large-scale course of the epidemic. As long as R is greater than 1, the number of cases will grow exponentially; if we find ways to drive R below 1, the number of cases will dwindle toward zero.

However, R is an average, and the average is not the whole story. With COVID-19, like many other epidemics, there is a lot of variation around the average.

According to a news feature in Nature, “One study in Hong Kong found that 19% of cases of COVID-19 were responsible for 80% of transmission, and 69% of cases didn’t transmit the virus to anyone.” In other words, most infections are caused by a small number of superspreaders.

This observation suggests a strategy for contact tracing. When an infected patient is discovered, it is common practice to identify people they have been in contact with who might also be infected. “Forward tracing” is intended to find people the patient might have infected; “backward tracing” is intended to find the person who infected the patient.

Now suppose you are a public health officer trying to slow or stop the spread of a communicable disease. Assuming that you have limited resources to trace contacts and test for the disease — and that’s a pretty good assumption — which do you think would be more effective, forward or backward tracing?

The inspection paradox suggests that backward tracing is more likely to discover a superspreader and the people they have infected.

According to the Nature article, “[Backward tracing] is extremely effective for the coronavirus because of its propensity to be passed on in superspreading events […] Any new case is more likely to have emerged from a cluster of infections than from one individual, so there’s value in going backwards to find out who else was linked to that cluster.”

To quantify this effect, let’s suppose that 70% of infected people don’t infect anyone else, as in the Hong Kong study, and the other 30% infect between 1 and 15 other people, uniformly distributed. The average of this distribution is 2.4, which is a plausible value of R.

Now suppose we discover an infected patient, trace forward, and find someone the patient infected. On average, we expect this person to infect 2.4 other people.

But if we trace backward and find the person who infected the patient, we are more likely to find someone who has infected a lot of people, and less likely to find someone who has only infected a few. In fact, the probability that we find any particular spreader is proportional to the number of people they have infected.

By simulating this sampling process, we can compute the distribution we would see by backward tracing. The average of this biased distribution is 10.1, more than four times the average of the unbiased distribution. This result suggests that backward tracing can discover four times more cases than forward tracing, given the same resources.

This example is not just theoretical; Japan adopted this strategy in February 2020. As Michael Lewis describes in The Premonition:

“When the Japanese health authorities found a new case, they did not waste their energy asking the infected person for a list of contacts over the previous few days, to determine whom the person might have infected in turn. […] Instead, they asked for a list of people with whom the infected person had interacted with further back in time. Find the person who had infected the newly infected person and you might have found a superspreader. Find a superspreader and you could track down the next superspreader before [they] really got going.”

So the inspection paradox is not always a nuisance; sometimes we can use it to our advantage.

This article is an excerpt from a new book I am working on, Probably Overthinking It: The puzzles and paradoxes of probability.

Bayesian Dice

August 9, 2021 AllenDowney

This article is available in a Jupyter notebook: click here to run it on Colab.

I’ve been enjoying Aubrey Clayton’s new book Bernoulli’s Fallacy. The first chapter, which is about the historical development of competing definitions of probability, is worth the price of admission alone.

One of the examples in Chapter 1 is a simplified version of a problem posed by Thomas Bayes. The original version, which I wrote about here, involves a billiards (pool) table; Clayton’s version uses dice:

Your friend rolls a six-sided die and secretly records the outcome; this number becomes the target T. You then put on a blindfold and roll the same six-sided die over and over. You’re unable to see how it lands, so each time your friend […] tells you only whether the number you just rolled was greater than, equal to, or less than T.
Suppose in one round of the game we had this sequence of outcomes, with G representing a greater roll, L a lesser roll, and E an equal roll:
G, G, L, E, L, L, L, E, G, L
Clayton, Bernoulli’s Fallacy, pg 36.

Based on this data, what is the posterior distribution of T?

Computing likelihoods

There are two parts of my solution; computing the likelihood of the data under each hypothesis and then using those likelihoods to compute the posterior distribution of T.

To compute the likelihoods, I’ll demonstrate one of my favorite idioms, using a meshgrid to apply an operation, like >, to all pairs of values from two sequences.

In this case, the sequences are

hypos: The hypothetical values of T, and
outcomes: possible outcomes each time we roll the dice

hypos = [1,2,3,4,5,6]
outcomes = [1,2,3,4,5,6]

If we compute a meshgrid of outcomes and hypos, the result is two arrays.

import numpy as np

O, H = np.meshgrid(outcomes, hypos)

The first contains the possible outcomes repeated down the columns.

array([[1, 2, 3, 4, 5, 6],
       [1, 2, 3, 4, 5, 6],
       [1, 2, 3, 4, 5, 6],
       [1, 2, 3, 4, 5, 6],
       [1, 2, 3, 4, 5, 6],
       [1, 2, 3, 4, 5, 6]])

The second contains the hypotheses repeated across the rows.

array([[1, 1, 1, 1, 1, 1],
       [2, 2, 2, 2, 2, 2],
       [3, 3, 3, 3, 3, 3],
       [4, 4, 4, 4, 4, 4],
       [5, 5, 5, 5, 5, 5],
       [6, 6, 6, 6, 6, 6]])

If we apply an operator like >, the result is a Boolean array.

O > H

array([[False,  True,  True,  True,  True,  True],
       [False, False,  True,  True,  True,  True],
       [False, False, False,  True,  True,  True],
       [False, False, False, False,  True,  True],
       [False, False, False, False, False,  True],
       [False, False, False, False, False, False]])

Now we can use mean with axis=1 to compute the fraction of True values in each row.

(O > H).mean(axis=1)

array([0.83333333, 0.66666667, 0.5       , 0.33333333, 0.16666667,
       0.        ])

The result is the probability that the outcome is greater than T, for each hypothetical value of T. I’ll name this array gt:

gt = (O > H).mean(axis=1)

The first element of the array is 5/6, which indicates that if T is 1, the probability of exceeding it is 5/6. The second element is 2/3, which indicates that if T is 2, the probability of exceeding it is 2/3. And do on.

Now we can compute the corresponding arrays for less than and equal.

lt = (O < H).mean(axis=1)

array([0.        , 0.16666667, 0.33333333, 0.5       , 0.66666667,
       0.83333333])

eq = (O == H).mean(axis=1)

array([0.16666667, 0.16666667, 0.16666667, 0.16666667, 0.16666667,
       0.16666667])

In the next section, we’ll use these arrays to do a Bayesian update.

The Update

In this example, computing the likelihoods was the hard part. The Bayesian update is easy. Since T was chosen by rolling a fair die, the prior distribution for T is uniform. I’ll use a Pandas Series to represent it.

import pandas as pd

pmf = pd.Series(1/6, hypos)
pmf

1    0.166667
2    0.166667
3    0.166667
4    0.166667
5    0.166667
6    0.166667

Now here’s the sequence of data, encoded using the likelihoods we computed in the previous section.

data = [gt, gt, lt, eq, lt, lt, lt, eq, gt, lt]

The following loop updates the prior distribution by multiplying by each of the likelihoods.

for datum in data:
    pmf *= datum

Finally, we normalize the posterior.

pmf /= pmf.sum()

1    0.000000
2    0.016427
3    0.221766
4    0.498973
5    0.262834
6    0.000000

Here’s what it looks like.

As an aside, you might have noticed that the values in eq are all the same. So when the value we roll is equal to T, we don’t get any new information about T. We could leave the instances of eq out of the data, and we would get the same answer.

Probably Overthinking It

Data science, Bayesian Statistics, and other ideas

Browsed by
Month: August 2021