Handicapping pub trivia
Introduction
The following question was posted recently on Reddit’s statistics forum:
If there is a quiz of
x
questions with varying results between teams of different sizes, how could you logically handicap the larger teams to bring some sort of equivalence in performance measure?[Suppose there are] 25 questions and a team of two scores 11/25. A team of 4 scores 17/25. Who did better […]?
One respondent suggested a binomial model, in which every player has the same probability of answering any question correctly.
I suggested a model based on item response theory, in which each question has a level of difficulty, d
, each player has a level of efficacy e
, and the probability that a player answers a question is
expit(e-d+c)
where c
is a constant offset for all players and questions and expit
is the inverse of the logit function.
Another respondent pointed out that group dynamics will come into play. On a given team, it is not enough if one player knows the answer; they also have to persuade their teammates.
I wrote some simulations to explore this question. You can see a static version of my notebook here, or you can run the code on Colab.
I implement a binomial model and a model based on item response theory. Interestingly, for the scenario in the question they yield opposite results: under the binomial model, we would judge that the team of two performed better; under the other model, the team of four was better.
In both cases I use a simple model of group dynamics: if anyone on the team gets a question, that means the whole team gets the question. So one way to think of this model is that “getting” a question means something like “knowing the answer and successfully convincing your team”.
Anyway, I’m not sure I really answered the question, other than to show that the answer depends on the model.