Combining Risks
Here’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.
Combining Risks¶
Here’s a question from the Reddit statistics forum.
Bit of a weird one but I’m hoping you’re the community to help. I work in children’s residential care and I’m trying to find a way of better matching potential young people together.
The way we calculate individual risk for a child is
risk = likelihood + impact
(R=L+I
), so
L4 + I5 = R9
That works well for individuals but I need to work out a good way of calculating a combined risk to place children [in] the home together. I’m currently using the [arithmetic] average but I don’t feel that it works properly as the average is always lower then the highest risk.
I’ll use a fairly light risk as an example, running away from the home. (We call this MFC missing from care) It’s fairly common that one of the kids will run away from the home at some point or another either out of boredom or frustration. If young person A has a risk of 9 and young person B has a risk of 12 the the average risk of MFC in the home would be 10.5
HOWEVER more often then not having two young people that go MFC will often result in more episodes as they will run off together, so having a lower risk rating doesn’t really make sense. Adding the two together to 21 doesn’t really work either though as the likelihood is the thing that increases not necessarily the impact.
I’m a lot better at chasing after run away kids then I am mathematics so please help 😂.
Here’s one way to think about this question: based on background knowledge and experience, OP has qualitative ideas about what happens when we put children at different risks together, and he is looking for a statistical summary that is consistent with these ideas.
The arithmetic mean probably makes sense as a starting point, but it clashes with the observation that if you put two children together who are high risk, they interact in ways that increase the risk. For example, if we put together children with risks 9 and 12, the average is 10.5, and OP’s domain knowledge says that’s too low — it should be more than 12.
At the other extreme, I’ll guess that putting together two low risk children might be beneficial to both — so the combined risk might be lower than either.
And that implies that there is a neutral point somewhere in the middle, where the combined risk is equal to the individual risks.
To construct a summary statistic like that, I suggest a weighted sum of the arithmetic and geometric means. That might sound strange, but I’ll show that it has the properties we want. And it might not be as strange as it sounds — there’s a reason it might make sense.
I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.
from os.path import basename, exists
def download(url):
filename = basename(url)
if not exists(filename):
from urllib.request import urlretrieve
local, _ = urlretrieve(url, filename)
print("Downloaded " + str(local))
return filename
download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from utils import decorate
# install the empiricaldist library, if necessary
try:
import empiricaldist
except ImportError:
!pip install empiricaldist
Weighted sum of means¶
The following function computes the arithmetic mean of a sequence of values, which is the sum divided by n
.
def amean(xs):
n = len(xs)
return np.sum(xs) / n
The following function computes the geometric mean of a sequence, which is the product raised to the power 1/n
.
def gmean(xs):
n = len(xs)
return np.prod(xs) ** (1/n)
And the following function computes the weighted sum of the arithmetic and geometric means.
The constant k
determines how much weight we give the geometric mean.
def mean_plus_gmean(*xs, k=1):
return amean(xs) + k * (gmean(xs) - 4)
The value 4
determines the neutral point. So if we put together two people with risk 4
, the combined average is 4
.
mean_plus_gmean(4, 4)
4.0
Above the neutral point, there is a penalty if we put together two children with higher risks.
mean_plus_gmean(5, 5)
6.0
In that case, the combined risk is higher than the individual risks. Below the neutral point, there is a bonus if we put together two children with low risks.
mean_plus_gmean(3, 3)
2.0
In that case, the combined risk is less than the individual risks.
If we combine low and high risks, the discrepancy brings the average down a little.
mean_plus_gmean(3, 5)
3.872983346207417
In the example OP presented, where we put together two people with high risk, the penalty is substantial.
mean_plus_gmean(9, 12)
16.892304845413264
If that penalty seems too high, we can adjust the weight, k
, and the neutral point accordingly.
This behavior extends to more than two people. If everyone is neutral, the result is neutral.
mean_plus_gmean(4, 4, 4)
3.9999999999999996
If you add one person with higher risk, there’s a moderate penalty, compared to the arithmetic mean.
mean_plus_gmean(4, 4, 5), amean([4, 4, 5])
(4.6422027133971, 4.333333333333333)
With two higher risk people, the penalty is bigger.
mean_plus_gmean(4, 5, 5), amean([4, 5, 5])
(5.308255500279445, 4.666666666666667)
And with three it is even bigger.
mean_plus_gmean(5, 5, 5), amean([5, 5, 5])
(5.999999999999999, 5.0)
Does this make any sense?¶
The idea behind this suggestion is logistic regression with an interaction term. Let me explain where that comes from. OP explained:
The way we calculate individual risk for a child is
risk = likelihood + impact
(R=L+I
), so
L4 + I5 = R9
At first I thought it was strange to add a likelihood and an impact score, Thinking about expected value, I thought it should be the product of a probability and a cost. But if both are on a log scale, adding these logarithms is like multiplying probability by impact on a linear scale, so that makes more sense.
And if the scores are consistent with something like a log-odds scale, we can see a connection with logistic regression.
If r1
and r2
are risk scores, we can imagine a regression equation that looks like this, where p
is the probability of an outcome like “missing from care”:
logit(p) = a r1 + b r2 + c r1 r2
In this equation, logit(p)
is the combined risk score, a
, b
, and c
are unknown parameters, and the product r1 r2
is an interaction term that captures the tendency of high risks to amplify each other.
With enough data, we could estimate the unknown parameters. Without data, the best we can do is chose values that make the results consistent with expectations.
Since r1
and r2
are interchangeable, they have to have the same parameter.
And since the whole risk scale has an unspecified zero point, we can set it a
and b
to 1/2
.
Which means there is only one parameter left, the weight of the interaction term.
logit(p) = (r_1 + r2) / 2 + k r1 r2
Now we can see that the first term is the arithmetic mean and the second term is close to the geometric mean, but without the square root.
So the function I suggested — the weighted sum of arithmetic and weighted means — is not identical to the logistic model, but it is motivated by it.
With this rationale in mind, we might consider a revision: rather than add the likelihood and impact scores, and then compute the weighted sum of means, it might make more sense to separate likelihood and impact, compute the weighted sum of the means of the likelihoods, and then add back the impact.
Computing by hand¶
In case the Python code makes it hard to see what’s going on, let’s work an example by hand.
Suppose r1
is 9 and r2
is 12.
r1 = 9
r2 = 12
Here’s the arithmetic mean.
m1 = (9 + 12) / 2
m1
10.5
Here’s the geometric mean.
m2 = np.sqrt(9 * 12)
m2
10.392304845413264
And here’s how we combine them.
k = 1
combined_risk = m1 + k * (m2 - 4)
combined_risk
16.892304845413264
Discussion¶
This question got my attention because OP is working on a challenging and important problem — and they provided useful context. It’s an intriguing idea to define something that is intuitively like an average, but is not always bounded between the minimum and maximum of the data.
If we think strictly about generalized means, that’s not possible. But if we think in terms of logarithms, regression, and interaction terms, we find a way.