To predict a job's duration it is useful to know the distribution of lifetimes for all jobs. For sequential jobs on UNIX workstations, two studies have reported distributions of lifetimes with the functional form
where k is a parameter that varies from workload to workload [12] [8]. When k is approximately 1, as it often is, the median remaining lifetime of a job is equal to its current age. This property is known as the past-repeats heuristic. In the next section, we will show that the distribution of lifetimes for parallel scientific applications does not fit this model; thus, the past-repeats heuristic does not apply in these environments.
Several previous studies have reported lifetime distributions for parallel scientific applications, but none discuss the shape of the distribution of lifetimes or use it to develop a workload model. Hotovy et al. [10] [11] describe the workload on the IBM SP2 at the Cornell Theory Center. Steube and Moore [14] and Wan et al. [15] describe the workload and scheduling policy on the Intel Paragon at the San Diego Supercomputer Center; in [14] the distribution of lifetimes for batch jobs appears to fit the uniform-log model proposed here (for jobs less than six hours in duration). Feitelson and Nitzberg [6] describe the workload on the Intel iPSC/860 at NASA Ames. Finally, Windisch et al. [16] compare the workloads from SDSC and NASA Ames.
Devarakonda and Iyer [3] use information about past executions of a program to predict the lifetimes of UNIX processes, as well as their file I/O and memory use.
Atallah et al. [1] propose the idea of choosing a cluster size that minimizes the turnaround time (queue time plus run time) of each job. Since their target system is a timesharing network of workstations, they consider the problem of contention with other jobs in the system, but they do not have the problem of predicting the time until a cluster becomes available. Like us, they raise the question of how local optimization by greedy users affects overall system performance. We address this question in [5]; we allow each job to choose the cluster size that minimizes its expected turnaround time, and find that this application-centric strategy leads to global performance better than that of many proposed system-centric heuristics.