cs358 Lecture Notes Week 4, Tuesday Statistical inference and Monte-Carlo Simulation ------------------------------------------------ How can you tell whether an observation is meaningful or random? This is _the_ central question in statistics. Examples: A coin (or die) is biased -- are there more heads than there are supposed to be? Kerill's Chert-Obsidian data -- is the composition of blades changing over time SEC violations -- every time Bob buys something, it goes up Sexual dimorphism -- the men I know are taller than the women I know, mostly Classical statistical technique ------------------------------- 1) assume that you are wrong and that the results are random 2) build a probablistic model of the system 3) use analysis to calculate the probability of the data you saw, within your probablistic model 4) if the calculated probability is small, you reject the assumption that you are wrong. Difficulties ------------ FORMULATING THE QUESTION... what were the chances of seeing that? Well, zero or one, depending on what you mean. Instead, you have to define some notion of weirdness and ask: what were the chances of seeing something as weird as that, if there were no underlying cause (other than chance). Additional difficulties... 1) arbitrary threshhold on probability 2) probablistic model may be wrong 3) not all probablistic models can be analyzed 4) not clear how to phrase your new beliefs based on the data (unsatisfying to say "we conclude that the null hypothesis is false) Alternatives ------------ 1) Bayesian statistics: all knowledge about the world is probablistic. Rather than accept or reject hypotheses, you modify your distributions according to data you observe. Example: a) start with an assumption about the way human heights are distributed b) collect some data about male and female heights c) update your "belief" about male heights using the male data, and conversely (the update uses Bayes' theorem, hence the name) d) now you have a belief about the distribution of heights for males and females 2) Monte-carlo simulation: rather than trying to analyze everything, use computers to run many simulations of random systems, and look at the distribution of outcomes. Advantage: no need to do math! Because you don't have to analyze the system, you don't have to make as many simplifying assumptions in your probablistic model. Example: coin-toss -- run actual trials rather than figure out the binomial distribution Kerill's data -- coin toss model human heights -- no need to assume that heights are distributed normally; you can use the actual observed distribution SEC violations -- need a way to generate realistic random time series of price. BIG DANGER ---------- often people get distracted looking at approximations in the calculation of probability, or focus on choosing just the right theshhold, and forget that their probability model is just a model! Examples: coin-toss: actually, in this case the probability model is pretty good human heights: sampling errors; racial and geographical factors SEC violations: what if your model of market fluctuations is wrong? basic models tend to underestimate the prob. of large moves likely to convict Bob wrongly! Kerill's data: underlying assumption that the knives that happened to wind up buried at a particular location represent the proportion of materials in circulation what if the dig happens to be the site of a chert knife maker? what if more valuable knives are less likely to wind up buried?