We can calculate the median of Q(n') exactly by enumerating all possible outcomes (which jobs complete and which are still running), and calculating the probability that the request will be satisfied before time t. Then we can find the median queue time by setting this probability to 0.5 and solving for t. This approach is not feasible when there are many jobs in the system, but it leads to an approximation that is fast to compute and that turns out to be sufficiently accurate for our intended purposes.
We represent each outcome by a bit vector, b, where for each bit, indicates that the ith job is still running, and indicates that the ith job has completed before time t. Since we assume independence between jobs in the system, the probability of a given outcome is the product of the probabilities of each event (the completion or non-completion of a job), and the probability of these events come from the CLM.
For a given outcome, the number of free processors is the sum of the processors freed by each job that completes:
Thus at time t, the probability that the number of free processors is at least the requested cluster size is the sum of the probabilities of all the outcomes that satisfy the request:
Finally, we find the median value of Q(n') by setting and solving for t.
Of course, the number of possible outcomes (and thus the time for this calculation) increases exponentially with p. Thus it is not a feasible approach when there are many running jobs. But when n' is small, it is often the case that there are several jobs running in the system that will single-handedly satisfy the request when they complete. In this case, the probability that the request will be satisfied by time t is dominated by the probability that any one of these benefactors will complete.
In other words, the chance that the queue time for n' processors will exceed time t is approximately equal to the probability that none of the benefactors will complete before t:
This calculation is only approximately correct, since it ignores the possibility that several small jobs might complete and satisfy the request. Thus, we expect this predictor to be inaccurate when there are many small jobs running in the system, few of which can single-handedly satisfy the request. The next section presents an alternative predictor that we expect to be more accurate in this case.