Accounting data from the Paragon

Accounting data from the Paragon at SDSC, 1995-96

SDSC used to have a 400-node Intel Paragon that was used for a variety of scientific applications. Since researchers studying parallel scheduling often use accounting data from supercomputers to develop workload models, we have collected this data for the past two years and are making it available on the WEB.

The format of the files is, basically, the output of the jrec command, run with the -j option. To interpret these files, you should read the man page for jrec.

The first three columns of these files originally contained the account number and username of the job owner, and the partition ID. In order to protect the privacy of SDSC's users, we have sanitized this data by hashing these three columns. Thus, the numbers in these columns are meaningless except that, for example, two jobs with the same number in the second column were in fact submitted by the same user. Unfortunately, the hashed "names" from the two files do not correspond. I might be able to fix that, though, so let me know if you need it.

This data is not perfect. As you work with it, you will find that there are some jobs that started running before they arrived, or ended before they began, or ran for what seems like a very long time. Also, there are many jobs for which the arrival, start, or end times are not known. So, to use this data, you will have to do some cleaning, and it will generally not be obvious what to do with anomalous entries. All I can suggest is that you try to be conservative (keep as much data as possible), document what you do, and test the impact of your decisions. For example, in one dataset I found that there were three very long-lived jobs. In fact, they were so long-lived that they were probably a figment of the accounting system's imagination. Unfortunately, discarding them altogether had a significant effect on the behavior of the system I was simulating. I decided that the best thing to do was to set their durations to 24 hours, which is the theoretical time limit on the machine.

If you use this data, please send mail to downey@sdsc.edu and tell me what you are up to. Also, please cite this web page.

Here is the data for 1995 (1.4 MB). And here is the corresponding downtime log.

Here is the data for 1996 (0.8 MB). And here is the corresponding downtime log.

Here are pointers to a couple of papers that have used this data:

"A parallel workload model and its implications for processor allocation," Allen B. Downey, click here.

"A Comparison of Workload Traces from Two Production Parallel Machines," Virginia Lo, Kurt Windisch, Yuke Zhuge, Dror Feitelson, and Reagan Moore, click here. This location also serves raw workload traces from other supercomputer sites.

"Impact of Job Mix on Optimizations for Space Sharing Schedulers," Jaspal Subhlok, Thomas Gross and T. Suzuoka, click here.

Here is the new data for the Cray T3E. Send me mail for the description of the contents.