Computational Modeling Fall 2005 For today, you should have: 1) finished HomeworkFourteen Today's outline: 1) handling empirical distributions 2) Zipf's Law, Pareto distribution, power law... For next time: HomeworkFifteen (to be posted soon) Survey results -------------- 1) most people in the 10-12 range, some fewer (especially during the first few weeks, some longer (especially people with less programming experience) 2) Most people feel comfortable with the material; Python programming is the thing most people are not yet comfortable about. Probstat, too. 3) Pace is good, some people thought slow at the beginning, better now. 4) Introductory reading seems to have achieved its goals: big picture introduction, whetting appetite. 5) Some thoughts: better for me to choose readings, skip reviews more discussion of the readings in class keep going with the readings (!) fewer books, more rigorous books, more accountability for reviews fewer books, dump the duds develop the context/concept pages more read more, faster fewer books, more discussion in class get to the modules with homework sooner more discussion of the books in class dump the bad books more problems with pirates in them (arrr!) 6) 8 in favor of more modules, smaller project 5 noncommittal 3 in favor of bigger project 7) Do you want a project for your portfolio? 8) Like the interconnectedness. Like the wiki, and online resources. Some frustration with the programming. Be more formal about what to turn in and how it will be assessed. More examples of discrete simulations that accomplish something beyond what analytic models can do. Handling empirical distributions -------------------------------- 1) Download http://wb/cm/code/Hist1.py and http://wb/cm/code/lifetimes 2) Read the first few lines of lifetimes. Each line is the duration, in seconds of CPU time, of a process run on a cluster of workstations at Berkeley in 1994. We observed 184 612 processes. Values of 0 were "rounded up" to 0.005, which is 1/2 the precision of the timer. 3) Read Hist1.py so you understand what it does, and run python Hist1.py lifetimes. Plot this data. Does this set of lifetimes obey a power law? Over what range? Over how many decades? What is the parameter? 4) Fill in the body of ccdf, and change the program to print the ccdf instead of the pdf. Plot this data and then try again to answer the questions in step 3. Zipf meets Pareto ----------------- Zipf's law states: f ~ r ^ -b Taking the log of both sides, we get log(f) ~ -b * log(r) So if we plot f versus r on a log-log scale, we see a straight line with slope -b. The inverse of this function expresses rank as a function of frequency: r ~ f ^ (- 1/b) One way to interpret this function is to say that for a given frequency, it tells me the number of words that have the same or higher frequency. If I divide both sides by n, the total number of words in the corpus, I get: r/n ~ f ^ (-1/b) / n Now the left side is a percentile, the fraction of words with frequency >= f. This is the complement of the CDF of the frequencies. Remember that CDF(x) = Prob{X <= x} so the >= and the <= don't quite agree, but let's ignore that. Zipf's law implies that 1 - r/n = CDF(f) = 1 - f ^ (-1/b) / n Which is exactly the same as saying that the distribution of f is Pareto with slope parameter k = 1/b. The usual way to check for a Pareto distibution is to plot the complementary CDF on a log-log scale. y = CDF(x) = 1 - (x / min) ^ -k Taking the log of both sides log(1-y) = -k log(x) + C So, again, we expect to see a line, this time with slope -1/b. Finally, you will sometimes see people plot a histogram of frequencies on a log-log scale. Since the histogram is an empirical estimate of the pdf, we expect to see something like: y = pdf(f) ~ k f ^ -(k+1) where k = 1/b y can be either the number of words with the given frequency or the fraction of words with a given frequency. Either way, we can take the log of both sides: log(y) ~ log(k) - (k+1) log(f) So, again, we get a straight line with slope -(k+1). In summary, there are three graphs people use: 1) frequency versus rank on a log-log scale a straight line supports Zipf's Law. 2) complementary CDF of the frequencies on a log-log scale a straight line indicates a Pareto distribution 3) a histogram or PDF of the frequencies on a log-log scale a straight line indicates a "power low" All three claims are mathematically equivalent. As an empirical test, (1) and (2) are more robust than (3), because the PDF is the derivative of the CDF. Looking at the integral allows us to see the cumulative effect of deviations from the distribution rather than raw, noisy deviations. Thanks to Lada Adamic at HP Labs, whose article "Zipf, power law and Pareto -- a ranking tutorial" helped me get all this straight. You might want to read it at: http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html