The structural cause of file size distributions

The structural cause of file size distributions

by Allen B. Downey


Papers and slides

The paper "The structural cause of file size distributions," is available as a Technical Report in gzipped Postscript (310 KB).

It appeared at SIGMETRICS '01 as a poster. Here is the poster in Postscript and PDF. It's 36 inches wide and 42.5 inches high.

A substantially revised version of the paper will appear at MASCOTS '01. That version is available in gzipped Postscript (321 KB).

I presented a talk about this work at Wellesley College. The slides I used are available in gzipped Postscript (1.4 MB) and PDF (1.3 MB).

I also presented a talk about this work at the University of Delaware. Here are the slides in gzipped Postscript (1.5 MB) and PDF (1.3 MB).


Abstract

We propose a user model that explains the shape of the distribution of file sizes in local file systems and in the World Wide Web. We examine evidence from 562 file systems, 38 web clients and 4 web servers, and find that this model is an accurate description of these systems, and a better description than an alternative that has been proposed, the Pareto model. We conclude that the distribution of file sizes is generally roughly lognormal and therefore not long-tailed. We discuss the implications of this conclusion for proposed explanations of self-similarity in the Internet.

Introduction

Numerous studies have reported traffic patterns in the Internet that show characteristics of self-similarity (see [ParkWillinger00] for a survey). These observations have led to a call for an explanatory model of self-similar network traffic [Park00].

The best models to date are based on the assumption that the distribution of transfer times in the network is long-tailed [PaxsonFloyd95] [ParulekarMakowski96] [WillingerTaqquShermanWilson95] [FeldmanGilbertHuangWillinger99]. In turn, this assumption is based on the assumption that the distribution of file sizes is long-tailed [ParkKimCrovella96] [CrovellaTaqquBestavros99].

We contend that the distribution of file sizes in most systems fits the lognormal distribution. We support this claim with empirical evidence from a variety of systems and also with a model of user behavior that explains why file systems tend to have this structure.

We argue that the proposed model is a better fit for the data than the long-tailed model, and furthermore that our user model is more realistic than the explanations for the alternative. We conclude that the distribution of file sizes is not long-tailed.

This result creates a problem for existing explanations of self-similarity in the Internet. We discuss the implications and suggest alternate approaches.