Predicting Queue Times

Predicting queue times on space-sharing parallel computers

by Allen B. Downey

This paper appeared at the 11th International Parallel Processing Symposium, Geneva, Switzerland, April 1997 (IPPS '97).

Here is the version that appeared (gzipped postscript).

Here is the HTML version.

I presented this paper in a talk for the Parallel Computing Lab at UCSD. Here are the slides I used (gzipped postscript).

Here is the sparsely-documented simulator I used to generate some of the results in this paper.


Abstract

We present statistical techniques for predicting the queue times experienced by jobs submitted to a space-sharing parallel machine with first-come-first-served (FCFS) scheduling. We apply these techniques to trace data from the Intel Paragon at the San Diego Supercomputer Center and the IBM SP2 at the Cornell Theory Center. We show that it is possible to predict queue times with accuracy that is acceptable for several intended applications. The coefficient of correlation between our predicted queue times and the actual values from the simulated schedules is between 0.65 and 0.72.

Introduction

On space-sharing parallel machines, it is useful to be able to predict how long a submitted job will be queued before processors are allocated to it. Some of the applications of these predictions are:

  • Load metrics: They provide a measure of load that is more concrete than abstractions such as load average, allowing users to make decisions about what jobs to run, where to run them and what size problems they can solve in an allotted time.
  • Internal resource selection: They allow malleable parallel jobs (jobs that are not limited to a specific cluster size, but can run on any number of processors) to choose a cluster size that will minimize their turnaround time (the sum of queue time and run time).
  • External resource selection: They allow distributed jobs to choose among various computing resources on a network, based on the quality of service they expect to receive at each host.
This paper presents a model of the workload on a parallel machine and shows how to use this model to predict queue times. We present observations of the workload on the Intel Paragon at the San Diego Supercomputer Center (SDSC) and the IBM SP2 at the Cornell Theory Center (CTC), and show that they fit the proposed model well. We use these workloads and a trace-driven simulator to evaluate the proposed techniques for predicting queue times. We conclude that our techniques can predict queue times on real machines with accuracy sufficient for the proposed application.

Conclusions

  • We have proposed a workload model for batch jobs on space-sharing parallel computers, based on a uniform-log distribution of lifetimes. The proposed model captures the behavior of real workloads in two different environments. This model can be used to generate synthetic workloads for simulating and evaluating scheduling policies for similar environments.
  • We have used this workload model to develop statistical techniques for predicting queue times for incoming jobs. The proposed techniques worked well in simulations of both of the environments we tested. These techniques should be useful for several applications, especially selection of cluster sizes on space-sharing parallel computers.
  • Many scheduling policies for supercomputing environments depend on information provided by users about the resource requirements of their jobs. We observe that this information is often unreliable, but show that the proposed techniques are able to distill this information in a useful way.