We Have a Cover!
And here it is…
Probably Overthinking It is available for preorder now!
And it is currently the #1 New Release in Probability & Statistics:
So if you are inclined to order a copy for yourself, now would be a fine time!
And here it is…
Probably Overthinking It is available for preorder now!
And it is currently the #1 New Release in Probability & Statistics:
So if you are inclined to order a copy for yourself, now would be a fine time!
I’ve been experimenting with programming assisted by Large Language Models (LLMs) like ChatGPT. I am amazed at how good it is, and it seems clear to me that the great majority of programming work will be LLM-assisted, starting now. Here are some of the examples I’ve tried.
For most of the exercises in Think Python, if you paste the question into ChatGPT, you get a correct answer on the first try. For example, here’s one of the exercises in the chapter on dictionaries:
Prompt: If you did Exercise 7, you already have a function named
has_duplicates
that takes a list as a parameter and returns True if there is any object that appears more than once in the list.Use a dictionary to write a faster, simpler version of
has_duplicates
.
And here’s the response:
ChatGPT: Here’s one possible implementation of the
has_duplicates
function using a dictionary to keep track of the elements in the list:
def has_duplicates(lst):
"""
Returns True if there are any duplicates in the given list, False otherwise.
"""
seen = {}
for item in lst:
if item in seen:
return True
seen[item] = True
return False
ChatGPT: The function works by iterating through the list and adding each item to the
seen
dictionary. If an item is already in the dictionary, it means that we’ve seen it before and therefore there’s a duplicate in the list, so we return True. If we iterate through the entire list without finding any duplicates, we return False.This implementation is faster and simpler than the previous one because checking for the presence of an item in a dictionary is much faster than iterating through the entire list each time. Also, using a dictionary allows us to keep track of the elements in the list using less code.
One of the later exercises asks what’s wrong with this class definition:
class Kangaroo:
"""A Kangaroo is a marsupial."""
def __init__(self, name, contents=[]):
"""Initialize the pouch contents.
name: string
contents: initial pouch contents.
"""
self.name = name
self.pouch_contents = contents
Here’s the response:
ChatGPT: The
Kangaroo
class has a potential issue in its__init__
method with thecontents
parameter. Specifically, the default value for thecontents
parameter is a mutable object (i.e., a list), which can cause unexpected behavior.When the
__init__
method is called without acontents
argument, it creates a new instance of the list object and assigns it to thecontents
parameter. However, since the default value forcontents
is a mutable object, the same list object is used as the default value for allKangaroo
instances that don’t provide acontents
argument.This means that if you create two
Kangaroo
instances without providing acontents
argument, both instances will share the same list object. This can lead to unexpected behavior if you modify the list of oneKangaroo
instance, as the changes will be reflected in the other instance as well.To fix this issue, you can use a more appropriate default value for the
contents
parameter, such asNone
, and then create a new list instance inside the__init__
method if nocontents
argument is provided, like this:
class Kangaroo:
def __init__(self, name, contents=None):
"""Initialize the pouch contents.
name: string
contents: initial pouch contents.
"""
self.name = name
if contents is None:
self.pouch_contents = []
else:
self.pouch_contents = contents
So that’s pretty good.
On Twitter, many people have helpfully pointed out that the exercises in Think Python, and solutions available on the web, are part of ChatGPT’s training corpus, so I am “testing on the training set”.
While that may be true, I think it misses the point several times over:
One of my complaints about data structures classes is that most of the work students do is re-implementation of basic algorithms. Most of them have been available in libraries for a long time, which means there is an opportunity to spend less time on how they work and more time on how we use them to do more interesting things. I hope the prevalence of LLMs will accelerate this process.
Here’s one of the exercises from Data Structures and Information Retrieval in Python (aka DSIRP):
Prompt: Write a function called intersect that takes two LetterSet objects and returns a new LetterSet that represents the intersection of the two sets. In other words, the new LetterSet should contain only elements that appear in both sets. In a comment, identify the order of growth of this function.
I also provided the introduction to the question, which introduced the LetterSet object. Here’s the code ChatGPT generated:
def intersect(set1, set2):
bits = [a and b for a, b in zip(set1.bits, set2.bits)]
return LetterSet(bits)
And its explanation:
ChatGPT: The
intersect
function first creates a new listbits
by using a list comprehension that computes the element-wise logical AND of the bits in the twoLetterSet
objects. The resulting list containsTrue
values only for the elements that appear in both sets, andFalse
values for all other elements.The
intersect
function then creates a newLetterSet
object using thebits
list and returns it.The order of growth of the
intersect
function is O(1), because the size of theLetterSet
objects is fixed at 26 elements, and the function always performs a constant number of operations (i.e., create a new list using a list comprehension, create a newLetterSet
object, and return it).
Unlike Think Python, DSIRP is not well known and if there are answers to these exercises on the web, they are not easy to find.
But again, I don’t think it matters where the answers are coming from. The impact on the practice of programming is the same: the solution to basic programming problems can be generated quickly, which means programmers will be able to spend more time on higher-value activities.
Let’s see how ChatGPT does with something a little harder:
Prompt: write a python function that takes the parameters of two poisson distributions and computes the probability of superiority
Here’s the code (I’ll skip the explanatory text):
from math import exp, factorial
def poisson_probability_of_superiority(lambda1, lambda2):
"""
Calculates the probability of superiority between two Poisson distributions.
Parameters:
lambda1 (float): the parameter of the first Poisson distribution.
lambda2 (float): the parameter of the second Poisson distribution.
Returns:
float: the probability that a random variable drawn from the first Poisson distribution is greater than a random variable drawn from the second Poisson distribution.
"""
prob = 0
for i in range(int(lambda1) + 1):
for j in range(int(lambda2)):
prob += (exp(-lambda1) * lambda1**i / factorial(i)) * (exp(-lambda2) * lambda2**j / factorial(j))
return prob
I asked for a version that uses SciPy and NumPy, and got this:
import numpy as np
from scipy.stats import poisson
def poisson_probability_of_superiority(lambda1, lambda2):
"""
Calculates the probability of superiority between two Poisson distributions.
Parameters:
lambda1 (float): the parameter of the first Poisson distribution.
lambda2 (float): the parameter of the second Poisson distribution.
Returns:
float: the probability that a random variable drawn from the first Poisson distribution is greater than a random variable drawn from the second Poisson distribution.
"""
k = np.arange(int(lambda1) + 1)
pmf1 = poisson.pmf(k, lambda1)
cdf2 = poisson.cdf(k - 1, lambda2)
prob = np.sum(pmf1 * (1 - cdf2))
return prob
This is close to being a good solution, but there are still a couple of problems. However, after just a few more iterations, I got a good solution to this problem, and a similar solution that works with exponential distributions. You can see the sequence of prompts and responses in this notebook.
I have no idea where this solution came from. If you Google for a solution to this problem, the results are not very useful. If anyone reading this can find the source of this code, please let me know.
Looking at the sequence of solutions ChatGPT generated, I got several ideas for solving this problem that I might not have come up with. The solutions I got still need some work, but I could polish them off without much additional effort. And I got ChatGPT to generate a unit test as well:
def test_exponential_probability_of_superiority():
# Test case with known result
lambda1 = 0.5
lambda2 = 1.0
expected_prob = 0.368
tol = 0.001
prob = exponential_probability_of_superiority(lambda1, lambda2)
assert abs(prob - expected_prob) < tol, f"Test case failed: expected {expected_prob}, but got {prob}"
I would not have confidence in this solution without additional refactoring and testing, but ChatGPT got me off to a very fast start on a non-trivial problem.
If you find yourself working with an unfamiliar language or libraries, LLMs can accelerate your learning — by a lot. Recently I started learning Elm, so I am not very good at it. And it is not a commonly-used language, so I expect the corpus to be limited. Let’s see how it does on a basic stats question:
Prompt: write an elm function to compute the correlation of two lists of floats
Here’s the response:
correlation : List Float -> List Float -> Maybe Float
correlation xs ys =
let
n = List.length xs
meanX = List.sum xs / Float.fromInt n
meanY = List.sum ys / Float.fromInt n
numerator = List.foldl (\(x, y) acc -> acc + ((x - meanX) * (y - meanY))) 0 (List.zip xs ys)
denominatorX = List.foldl (\x acc -> acc + ((x - meanX) * (x - meanX))) 0 xs
denominatorY = List.foldl (\y acc -> acc + ((y - meanY) * (y - meanY))) 0 ys
denominator = sqrt (denominatorX * denominatorY)
in
if denominator == 0 then
Nothing
else
Just (numerator / denominator)
That’s pretty good, and the explanation is even better:
ChatGPT: The
correlation
function takes in two lists of floatsxs
andys
and computes the Pearson correlation coefficient between them. The result is returned as aMaybe Float
, as it is possible for the denominator to be 0 if one of the lists is empty or if all the elements in one of the lists are the same, in which case the correlation is undefined.The function first computes the mean of each list and then uses these means to calculate the numerator of the correlation coefficient using a fold over the zipped lists. It then calculates the denominators for each list separately, and computes the denominator of the correlation coefficient by taking the product of these two denominators and then taking the square root. Finally, the function checks if the denominator is 0 and returns
Nothing
in that case, or returns the correlation coefficient as aJust Float
.
To someone getting the hang of Elm’s type system, this is very helpful.
I asked it to generate a version using map instead of foldl, and another using existing functions to compute mean and variance. You can see the entire exchange in this notebook.
When I wrote about these examples on Twitter, I got more disagreement than I expected. Lots of people reminded me of the limitations of LLMs for generating code. But again, I think this is missing the point. Even if LLMs only solve simple programming problems, there are a lot of simple programming problems! And I conjecture that most programmers spend most of their time on things that ChatGPT could greatly accelerate — or just solve.
And we’ve only been using them for a few weeks! LLMs will get better, and we will get better at using them. So I stand by my conclusion: The great majority of coding will be LLM-assisted, starting now.
LLMs will also have a huge effect on how we teach and learn programming, but I’ll get to that later.
Most of the skills programmers use now are also the skills they will need to work with LLMs. Breaking a problem down into smaller problems, and designing good interfaces between components, are still essential skills. One difference is that now, for each of those smaller problems, programmers need to decide whether it would be easier and faster to solve it themselves or start a conversation with an LLM.
After deciding whether to start a conversation, the next big question is how to compose the prompt. In particular, it makes a big difference how much information is included as a preamble. Just as we all got better at composing search terms, we’ll get better at composing prompts. (I have a hard time not being polite to ChatGPT. I wonder if that will persist, or we’ll start writing blunt imperatives.)
And a final question is when to stop a conversation and work with the code you have, or ask for further refinements. In my experiments, it felt like I reached a point of diminishing returns, where further refinements were likely to introduce new errors. On the other hand, asking for at least two versions of a function produced useful variations.
At least for now, we cannot assume that code produced by an LLM is correct, which means it needs extensive testing. People who are used to test-driven development (TDD) will have a head start with LLM-assisted programming. Of course, we can use LLMs to generate unit tests as well, but then we have to validate the unit tests, too.
Which brings me to what I think will be the most important skill for LLM-assisted programming: reading code. LLMs can generate code much faster than we can understand it, so the ability to read, understand, and check code will be critical.
The other skill that will become more important is meta-language, that is, the vocabulary we use to talk about programs. In my correlation in Elm example, I asked ChatGPT to “factor out the anonymous function”, and it new exactly what I meant. In general, it seems to understand the meta-language of programming well, so it will be useful if we can speak it.
Most of the skills programmers need to work with LLMs are the skills they already have, but some of them will become more important, especially problem decomposition, reading code, and speaking the meta-language of programming.
When I started work at Brilliant a couple of weeks ago, I learned that one of my new colleagues, Michelle McSweeney, just published a book called OK, which is all about the word OK.
As we discussed the joys and miseries of publishing, Michelle mentioned that she had found a typo in the book after publication. So naturally I took it as a challenge to find the typo. While I was searching, I enjoyed the book very much. If you are interested in etymology, linguistics, and history, I recommend it!
As it turned out, I found exactly one typo. When I told Michelle, she asked me nervously which page it was on. Page 17. She looked disappointed – that was not the same typo she found.
Now, for people who like Bayesian statistics, this scenario raises some questions:
As it happens, I used a similar scenario as an example in Think Bayes. So I was able to reuse some code and answer these questions.
You can read my solution here.
You can also click here to run the notebook with the solution on Colab.
It’s been a while since anyone said “killer app” without irony, so let me remind you that a killer app is software “so necessary or desirable that it proves the core value of some larger technology,” quoth Wikipedia. For example, most people didn’t have much use for the internet until the world wide web was populated with useful content and the first generation of browsers made it easy to access.
So what is the Bayesian killer app? That is, for people who don’t know much about Bayesian methods, what’s the application that demonstrates their core value? I have a nomination: Thompson sampling, also known as the Bayesian bandit strategy, which is the foundation of Bayesian A/B testing.
I’ve been writing and teaching about Bayesian methods for a while, and Thompson sampling is the destination that provides the shortest path from Bayes’s Theorem to a practical, useful method that is meaningfully better than the more familiar alternative, hypothesis testing in general and Student’s t test in particular.
So what does that path look like? Well, funny you should ask, because I presented my answer last November as a tutorial at PyData Global 2022, and the video has just been posted:
VIDEO HERE
And if you like the tutorial, you’ll love the game: here are the instructions for a game I designed that uses dice to implement Thompson sampling.
Abstract
This tutorial is a hands-on introduction to Bayesian Decision Analysis (BDA), which is a framework for using probability to guide decision-making under uncertainty. I start with Bayes’s Theorem, which is the foundation of Bayesian statistics, and work toward the Bayesian bandit strategy, which is used for A/B testing, medical tests, and related applications. For each step, I provide a Jupyter notebook where you can run Python code and work on exercises. In addition to the bandit strategy, I summarize two other applications of BDA, optimal bidding and deriving a decision rule. Finally, I suggest resources you can use to learn more.
Outline
Prerequisites
For this tutorial, you should be familiar with Python at an intermediate level. We’ll use NumPy, SciPy, and Pandas, but I’ll explain what you need to know as we go. You should be familiar with basic probability, but you don’t need to know anything about Bayesian statistics. I provide Jupyter notebooks that run on Colab, so you don’t have to install anything or prepare ahead of time. But you should be familiar with Jupyter notebooks.
My audience skews left; that is, the people who read my blog are more liberal, on average, than the general population. For example, if I surveyed my readers and asked where they place themselves on a scale from liberal to conservative, the results might look like this:
To be clear, I have not done a survey and this is fake data, but if it were real, we would conclude that my audience is more liberal, on average, than the general population. So in the normal use of the word skew, we might say that this distribution “skews to the left”.
But according to statisticians, that would be wrong, because within the field of statistics, skew has been given a technical meaning that is contrary to its normal use. Here’s how Wikipedia explains the technical definition:
positive skew: The right tail is longer; the mass of the distribution is concentrated on the left of the figure. The distribution is said to be right-skewed, right-tailed, or skewed to the right, despite the fact that the curve itself appears to be skewed or leaning to the left; right instead refers to the right tail being drawn out and, often, the mean being skewed to the right of a typical center of the data. A right-skewed distribution usually appears as a left-leaning curve.
https://en.wikipedia.org/wiki/Skewness
By this definition, we would say that the distribution of political alignment in my audience is “skewed to the right”. It is regrettable that the term was defined this way, because it’s very confusing.
Recently I ran a Twitter poll to see what people think skew means. Here are the results:
Interpreting these results is almost paradoxical: the first two responses are almost equally common, which proves that the third response is correct. If the statistically-literate people who follow me on Twitter don’t agree about what skew means, we have to treat it as ambiguous unless specified.
The comments suggest I’m not the only one who thinks the technical definition is contrary to intuition.
If you think the definition of skew is bad, let’s talk about bias. In the context of statistics, bias is “a systematic tendency which causes differences between results and fact”. In particular, sampling bias is bias caused by a non-representative sampling process.
In my imaginary survey, the mean of the sample is less than the actual mean in the population, so we could say that my sample is biased to the left. Which means that the distribution is technically biased to the left and skewed to the right. Which is particularly confusing because in natural use, bias and skew mean the same thing.
So 20th century statisticians took two English words that are (nearly) synonyms, and gave them technical definitions that can be polar opposites. The result is 100 years of confusion.
For early statisticians, it seems like creating confusing vocabulary was a hobby. In addition to bias and skew, here’s a partial list of English words that are either synonyms or closely related, which have been given technical meanings that are opposites or subtly different.
And don’t get me started on “significance”.
If you got this far, it seems like you are part of my audience, so if you want to answer a one-question survey about your political alignment, follow this link. Thank you!
My poll and this article were prompted by this excellent video about the Central Limit Theorem:
Around the 7:52 mark, a distribution that leans left is described as “skewed towards the left”. In statistics jargon, that’s technically incorrect, but in this context I think it’s is likely to be understood as intended.
This recent article in the Washington Post reports that “a police department in Maryland is training officers to spot the signs of driving high by watching people toke up in a tent”. The story features Lt. John O’Brien, who is described as a “trained drug recognition expert”. It also quotes a defense attorney who says, “There are real questions about the scientific validity of what they’re doing.”
As it happens, the scientific validity of Drug Recognition Experts is one of the examples in my forthcoming book, Probably Overthinking It. The following is an excerpt from Chapter 9: “Fairness and Fallacy”.
In September 2017 the American Civil Liberties Union (ACLU) filed suit against Cobb County, Georgia on behalf of four drivers who were arrested for driving under the influence of cannabis. All four were evaluated by Officer Tracy Carroll, who had been trained as a “Drug Recognition Expert” (DRE) as part of a program developed by the Los Angeles Police Department in the 1970s.
At the time of their arrest, all four insisted that they had not smoked or ingested any cannabis products, and when their blood was tested, all four results were negative; that is, the blood tests found no evidence of recent cannabis use.
In each case, prosecutors dismissed the charges related to impaired driving. Nevertheless, the arrests were disruptive and costly, and the plaintiffs were left with a permanent and public arrest record.
At issue in the case is the assertion by the ACLU that, “Much of the DRE protocol has never been rigorously and independently validated.”
So I investigated that claim. What I found was a collection of studies that are, across the board, deeply flawed. Every one of them features at least one methodological error so blatant it would be embarrassing at a middle school science fair.
As an example, the lab study most often cited to show that the DRE protocol is valid was conducted at Johns Hopkins University School of Medicine in 1985. It concludes, “Overall, in 98.7% of instances of judged intoxication the subject had received some active drug”. In other words, in the cases where one of the Drug Recognition Experts believed that a subject was under the influence, they were right 98.7% of the time.
That sounds impressive, but there are several problems with this study. The biggest is that the subjects were all “normal, healthy” male volunteers between 18 and 35 years old, who were screened and “trained on the psychomotor tasks and subjective effect questionnaires used in the study”.
By design, the study excluded women, anyone older than 35, and anyone in poor health. Then the screening excluded anyone who had any difficulty passing a sobriety test while they were sober — for example, anyone with shaky hands, poor coordination, or poor balance.
But those are exactly the people most likely to be falsely accused. How can you estimate the number of false positives if you exclude from the study everyone likely to yield a false positive? You can’t.
Another frequently-cited study reports that “When DREs claimed drugs other than alcohol were present, they [the drugs] were almost always detected in the blood (94% of the time)”. Again, that sounds impressive until you look at the methodology.
Subjects in this study had already been arrested because they were suspected of driving while impaired, most often because they had failed a field sobriety test.
Then, while they were in custody, they were evaluated by a DRE, that is, a different officer trained in the drug evaluation procedure. If the DRE thought that the suspect was under the influence of a drug, the suspect was asked to consent to a blood test; otherwise they were released.
Of 219 suspects, 18 were released after a DRE performed a “cursory examination” and concluded that there was no evidence of drug impairment.
The remaining 201 suspects were asked for a blood sample. Of those, 22 refused and 6 provided a urine sample only.
Of the 173 blood samples, 162 were found to contain a drug other than alcohol. That’s about 94%, which is the statistic they reported.
But the base rate in this study is extraordinarily high, because it includes only cases that were suspected by the arresting officer and then confirmed by the DRE. With a few generous assumptions, I estimate that the base rate is 86%; in reality, it was probably higher.
To estimate the base rate, let’s assume:
Both of these assumptions are generous; that is, they probably overestimate the accuracy of the DREs. Even so, they imply that 188 out of 219 blood tests would have been positive, if they had been tested. That’s a base rate of 86%.
Because the suspects who were released were not tested, there is no way to estimate the sensitivity of the test, but let’s assume it’s 99%, so if a suspect is under the influence of a drug, there is a 99% chance a DRE would detect it. In reality, it is probably lower.
With these generous assumptions, we can use the following table to estimate the sensitivity of the DRE protocol.
Suspects | Prob Positive | Cases | Percent | |
---|---|---|---|---|
Impaired | 86 | 0.99 | 85.14 | 93.8 |
Not impaired | 14 | 0.40 | 5.60 | 6.2 |
With 86% base rate, we expect 86 impaired suspects out of 100, and 14 unimpaired. With 99% sensitivity, we expect the DRE to detect about 85 true positives. And with 60% specificity, we expect the DRE to wrongly accuse 5.6 suspects. Out of 91 positive tests, 85 would be correct; that’s about 94%, as reported in the study.
But this accuracy is only possible because the base rate in the study is so high. Remember that most of the subjects had been arrested because they had failed a field sobriety test. Then they were tested by a DRE, who was effectively offering a second opinion.
But that’s not what happened when Officer Tracy Carroll arrested Katelyn Ebner, Princess Mbamara, Ayokunle Oriyomi, and Brittany Penwell. In each of those cases, the driver was stopped for driving erratically, which is evidence of possible impairment. But when Officer Carroll began his evaluation, that was the only evidence of impairment.
So the relevant base rate is not 86%, as in the study; it is the fraction of erratic drivers who are under the influence of drugs. And there are many other reasons for erratic driving, including distraction, sleepiness, and the influence of alcohol. It’s hard to say which explanation is most common. I’m sure it depends on time and location. But as an example, let’s suppose it is 50%; the following table shows the results with this base rate.
Suspects | Prob Positive | Cases | Percent | |
---|---|---|---|---|
Impaired | 50 | 0.99 | 49.5 | 71.2 |
Not impaired | 50 | 0.40 | 20.0 | 28.8 |
With 50% base rate, 99% sensitivity, and 60% specificity, the predictive value of the test is only 71%; under these assumptions, almost 30% of the accused would be innocent. In fact, the base rate, sensitivity, and specificity are probably lower, which means that the value of the test is even worse.
The suit filed by the ACLU was not successful. The court decided that the arrests were valid because the results of the field sobriety tests constituted “probable cause” for an arrest. As a result, the court did not consider the evidence for, or against, the validity of the DRE protocol. The ACLU has appealed the decision.
Sources:
I’m happy to report that copyediting of Modeling and Simulation in Python is done, and the book is off to the printer! Electronic versions are available now from No Starch Press; print copies will be available in May, but you can pre-order now from No Starch Press, Amazon, and Barnes and Noble.
To celebrate, I just published one of the case studies from the end of Part I, which is about simulating discrete systems. The case study explores a classic question from queueing theory:
Suppose you are designing the checkout area for a new store. There is room for two checkout counters and a waiting area for customers. You can make two lines, one for each counter, or one line that serves both counters.
In theory, you might expect a single line to be better, but it has some practical drawbacks: in order to maintain a single line, you would have to install rope barriers, and customers might be put off by what seems to be a longer line, even if it moves faster.
So you’d like to check whether the single line is really better and by how much.
Simulation can help answer this question. The following figure shows the three scenarios I simulated:
The leftmost diagram shows a single queue (with customers arriving at rate 𝜆) and a single server (with customers completing service at rate 𝜇).
The center diagram shows a single queue with two servers, and the rightmost diagram shows two queue with two servers.
So, which is the best, and by how much? You can read my answer in the online version of the book. Or you can run the Jupyter notebook on Colab.
Here’s what some of the results look like:
This figure shows the time customers are in the system, including wait time and service time, as a function of the arrival rate. The orange line shows the average we expect based on analysis; the blue dots show the result of simulations.
This comparison shows that the simulation and analysis are consistent. It also demonstrates one of the features of simulation: it is easy to quantify not just the average we expect but also the variation around the average.
That capability turns out to be useful for this problem because, as it turns out, the difference between the one-queue and two-queue scenarios is small compared to the variation, which suggests the advantage would be unnoticed in practice.
I conclude:
The two configurations are equally good as long as both servers are busy; the only time two lines is worse is if one queue is empty and the other contains more than one customer. In real life, if we allow customers to change lanes, that disadvantage can be eliminated.
From a theoretical point of view, one line is better. From a practical point of view, the difference is small and can be mitigated. So the best choice depends on practical considerations.
On the other hand, you can do substantially better with an express line for customers with short service times. But that’s a topic for another case study.
Sadly, today is my last day at DrivenData, so it’s a good time to review one of the projects I’ve been working on, using probabilistic predictions from Zamba to find animals in camera trap videos, like this:
Zamba is one of the longest-running projects at DrivenData. You can read about it in this blog post: Computer vision for wildlife monitoring in a changing climate.
And if you want to know more about my part of it, I wrote this series of articles.
Most recently, I’ve been working on calibrating the predictions from convolutional neural networks (CNNs). I haven’t written about it, but at ODSC East 2023, I’m giving a talk about it:
Don’t Discard That Pangolin, Calibrate Your Deep Learning Classifier
Suppose you are an ecologist studying a rare species like a pangolin. You can use motion-triggered camera traps to collect data about the presence and abundance of species in the wild, but for every video showing a pangolin, you have 100 that show other species, and 100 more that are blank. You might have to watch hours of video to find one pangolin.
Deep learning can help. Project Zamba provides models that classify camera trap videos and identify the species that appear in them. Of course, the results are not perfect, but we can often remove 80% of the videos we don’t want while losing only 10-20% of the videos we want.
But there’s a problem. The output from deep learning classifiers is generally a “confidence score”, not a probability. If a classifier assigns a label with 80% confidence, that doesn’t mean there is an 80% chance it is correct. However, with a modest number of human-generated labels, we can often calibrate the output to produce more accurate probabilities, and make better predictions.
In this talk, I’ll present use cases based on data from Africa, Guam, and New Zealand, and show how we can use deep learning and calibration to save the pangolin… or at least the pangolin videos. This real-world problem shows how users of ML models can tune the results to improve performance on their applications.
The ODSC schedule isn’t posted yet, but I’ll fill in the details later.
Way back in 2013, I wrote this blog post explaining why you should never use a statistical test to check whether a sample came from a Gaussian distribution. I argued that data from the real world never come from a Gaussian distribution, or any other simple mathematical model, so the answer to the question is always no. And there are only two possible outcomes from the test:
Either way, the result doesn’t tell you anything useful.
In this article, I will explore a particular example and demonstrate this relationship between the sample size and the outcome of the test. And I will conclude, again, that
Choosing a distribution is not a statistical question; it is a modeling decision. No statistical test can tell you whether a particular distribution is a good model for your data.
For the technical details, you can read the extended version of this article or run this notebook on Colab.
I’ll start by generating a sample that is actually from a lognormal distribution, then use the sample mean and standard deviation to make a Gaussian model. Here’s what the empirical distribution of the sample looks like compared to the CDF of the Gaussian distribution.
It looks like the Gaussian distribution is a pretty good model for the data, and probably good enough for most purposes.
According to the Anderson-Darling test, the test statistic is 1.7, which exceeds the critical value, 0.77, so at the 5% significance level, we can reject the hypothesis that this sample came from a Gaussian distribution. That’s the right answer, so it might seem like we’ve done something useful. But we haven’t.
The result from the A-D test depends on the sample size. The following figure shows the probability of rejecting the null hypothesis as a function of sample size, using the lognormal distribution from the previous section.
When the sample size is more than 200, the probability of rejection is high. When the sample size is less than 100, the probability of rejection is low. But notice that it doesn’t go all the way to zero, because there is always a 5% chance of a false positive.
The critical value is about 120; at that sample size, the probability of rejecting the null is close to 50%.
So, again, if you have enough data, you’ll reject the null; otherwise you probably won’t. Either way, you learn nothing about the question you really care about, which is whether the Gaussian model is a good enough model of the data for your purposes.
That’s a modeling decision, and no statistical test can help. In the original article, I suggested some methods that might.
A recent question on Reddit asked about using resampling with logistic regression. The responses suggest two ways to do it, one parametric and one non-parametric. I implemented both of them and then invented a third, which is hybrid of the two.
You can read the details of the implementation in the extended version of this article.
Or you can click here to run the Jupyter notebook on Colab
Different ways of computing sampling distributions – and the statistics derived from them, like standard errors and confidence intervals – yield different results. None of them are right or wrong; rather, they are based on different modeling assumptions.
In this example, it is easy to implement multiple models and compare the results. If they were substantially different, we would need to think more carefully about the modeling assumptions they are based on and choose the one we think is the best description of the data-generating process.
But in this example, the differences are small enough that they probably don’t matter in practice. So we are free to choose whichever is easiest to implement, or fastest to compute, or convenient in some other way.
It is a common error to presume that the result of an analytic method is uniquely correct, and that results from computational methods like resampling are approximations to it. Analytic methods are often fast to compute, but they are always based on modeling assumptions and often based on approximations, so they are no more correct than computational methods.