Choosing Cluster Sizes for Space-Sharing Multiprocessors

A parallel workload model and its implications for processor allocation

by Allen B. Downey

This paper appeared at the 6th IEEE International Symposium on High Performance Distributed Computing (HPDC 97). The version that appeared there is available in gzipped postscript and postscript.

I presented this paper at AT&T Research in April 1997. Here are the slides I used (gzipped postscript).

The submitted version of this paper was been published as U.C. Berkeley Technical Report CSD-96-922 (gzipped postscript).

The abstract, introduction, and conclusions are below.

Abstract

We develop a workload model based on the observed behavior of parallel computers at the San Diego Supercomputer Center and the Cornell Theory Center. This model gives us insight into the performance of strategies for scheduling malleable jobs on space-sharing parallel computers. We find that Adaptive Static Partitioning (ASP), which has been reported to work well for other workloads, is inferior to some FIFO strategies that adapt better to system load. The best of the strategies we consider is one that explicitly restricts cluster sizes when load is high (a variation of Sevcik's A+ strategy).

Introduction

Space-sharing, distributed-memory multiprocessors, like the Intel Paragon, the Cray T3E and the IBM SP2, are often used in supercomputing environments to support scientific applications. These environments typically have the following characteristics:

For batch processing, jobs do not share processors, but rather allocate a cluster of processors exclusively and run to completion . Many of these machines also have an interactive partition that uses timesharing, but this paper only addresses scheduling strategies for batch partitions (pure space-sharing). In the environments we have observed, the vast majority of computation is done in batch mode.
Many jobs on these systems are malleable , meaning that they are capable of running on a range of cluster sizes. On the other hand, the programming models used for scientific applications usually do not generate jobs that can change cluster sizes dynamically . Thus, once a job begins execution, its cluster size is fixed.

In current systems, users choose cluster sizes for their jobs by hand, and the system does not have the option of allocating more or fewer than the requested number of processors. The factors that should influence the choice of a cluster size include the characteristics of the application (resource requirements), the load on the system (resource availability) and the performance requirements of the user (deadlines, desired throughput, etc.). But users generally do not have the information, tools, or inclination to weigh all of these factors accurately. Allowing the system to make this decision has the potential to improve system utilization and reduce users' wait times.

Toward this end, prior studies have proposed allocation strategies that choose automatically how many processors to allocate to each job. Most of these studies evaluate the proposed strategies with analysis and simulation based on hypothetical workloads. Each of the strategies we consider in this paper has been evaluated, under different workload assumptions, in at least three prior studies. One of these compares the performance of these strategies over a wide range of workload parameters, and argue that the discrepancies among various studies are due to differences in the hypothesized workloads.

The goal of this paper is to focus this debate by constructing a new workload model based on observations of space-sharing systems running scientific workloads. We intend this model to cover the range of workload parameters most relevant to current workloads and hence most applicable to real systems. By narrowing the range of workload parameters, we are able to examine the proposed strategies in more detail and gain insight into the reasons for their success or failure.

Conclusions

One of the strategies recommended in other studies (ASP) did not perform well for our workload. We show that this policy is too sensitive to short-term variations in system load. It performs worse than simple FIFO strategies that use application characteristics to bound cluster sizes.
The application characteristics we examined, average and variance of parallelism, are useful for choosing the upper bound on cluster size (and thereby imposing a lower bound on efficiency). We found, though, that strategies that considered variance of parallelism were no better than those that considered only average parallelism.
The processor working set, or ``knee'' of the speedup curve, is not an optimal processor allocation.
It is not necessary to set maximum cluster sizes precisely. There is a tradeoff between large clusters/long queues and small clusters/short queues, but within a wide range, system performance (measured by average turnaround time) does not vary greatly.
Of several strategies with equivalent turnaround times, Sevcik's strategy, which uses a priori knowledge about system load to restrict cluster sizes, yielded the lowest slowdowns. Depending on what metric matters most to users, this strategy might be the best choice.
One of Sevcik's underlying assumptions---that cluster sizes should decrease linearly as load increases---has been validated. The other underlying assumption---that jobs with a more variable parallelism profile should allocate fewer processors---has been contradicted.
Lifetimes for batch jobs on supercomputers are distributed uniformly in log space. Our uniform-log model is useful for summarizing these distributions and generating simulated workloads. The observed distributions have coefficients of variation in the range 2--4.