On space-sharing parallel computers, it is useful to be able to predict
how long a submitted job will be queued before processors are
allocated to it. Some of the applications of these predictions are:
Load metrics:
They provide a measure of load that is more
concrete than abstractions such as load average, allowing users to
make decisions about what jobs to run, where to run them or what size
problems they can solve in an allotted time.
Internal resource selection:
They allow malleable parallel jobs
(jobs that do not require a specific number of processors, but can run on
a range of cluster sizes) to choose a cluster size that is appropriate
for the current state of the machine. This type of allocation is also
called adaptive partitioning.
External resource selection:
They allow distributed jobs to
choose among various computing resources on a network, based
on the quality of service they expect to receive at each host.
As part of the DOCT project [13] we are planning to implement
the techniques proposed here to support resource selection in
a distributed, heterogeneous environment.
For different applications, we will make our predictions at different
times: for external resource selection, we need to predict the entire
queue time from arrival to beginning of execution; for internal
resource selection we will consider only the time from arrival at the
head of the queue until beginning of execution, which is sometime
called wait time.
The focus of this paper is internal resource selection, and thus we will
be making predictions when jobs arrive at the head of the queue. In
future work we will extend these techniques to include the entire time
a job spends in queue, and apply those predictions to external resource
selection.