Sunday, June 24, 2007

Central Limit

My sister gave me a book titled, The Black Swan (The Impact of the Highly Improbable), by Nassim Nicholas Taleb, whose previous book, Fooled by Randomness, was supposedly a bestseller. Taleb is a former Wall Street trader in derivatives, and he made a lot of money thereby, which, ironically enough, he claims is meaningless, a product of luck and a privileged position (and my readers would know that I agree with him on that one), but which nevertheless is almost certainly why he was given a book deal in the first place.

With similar irony, he spends substantial amounts of print in The Black Swan railing against the human tendency to substitute narrative for data, but, of course, the way he conveys his point is via life stories and anecdotes (some of which are fictional). He also insults the French a lot.

Well, golly, he certainly is "stimulating," by which I mean both wrongheaded and just plain wrong about many subjects that I find interesting. I'm not going to attempt a review of his book, because one simply does not try to swat flies in a barnyard, but I will use Mr. Taleb's book as an excuse to write about a few things that it reminds me to write about. One of them, (and god help you who are reading this) is the Central Limit Theorem (CLT).

Taleb writes quite a bit about statistics and their use and misuse, and I'd be there, dude, were it not for the part about his being severely wrong. Much of the time. Indeed, whenever I'm confronted with someone arguing about the use of statistics, I check to see if they have anything to say about the Central Limit Theorem, because that's were the muckup usually begins. Taleb, it's true, makes almost exactly the opposite error that's usually made, but it turns out that being against people who are wrong is not the perfect path to truth that one might hope it to be.

So here is what he says (note on page 325):

The notion of central limit: very misunderstood; it takes a long time to reach the central limit-so as we do not live in the asymptote, we've got problems. All various random variables (as we started [sic] in Chapter 16 [actually Chapter 15-JK] with a +1 or -1, which is called a Bernoulli draw) under summation (we did sum up the wins of the 40 tosses) become Gaussian. Summation is the key here, since we are considering the results of adding up the 40 steps, which is where the Gaussian, under the first and second central assumptions becomes what is called a "distribution." (A distribution tells you how you are likely to have your outcomes spread out, or distributed). However, they may get there at different speeds. This is called the central limit theorem; if you add random variables coming from these tame jumps, it will lead to the Gaussian.


Ah, where to begin. The example that Taleb refers to is a coin flipping sequence that is actually a "Binomial Distribution" that does indeed converge to a good replica of the Gaussian, but that has little to do with the Central Limit Theorem.

Taleb seems to have learned that the CLT has something to do with addition, and that it says that things tend toward a Gaussian (also called "normal") distribution. From there on it's pretty much (to adapt a phrase from Mel Brooks) "authentic techno-gibberish." It's the sort of thing that someone writes when they are trying to snow you.

Here's the deal. The Central Limit Theorem says that, for almost any distribution of numbers that can be produced (you are limited to finite numbers), if you take a large enough sample of those numbers, and average the samples together, the averages of the samples will form an approximately Gaussian distribution around the true mean of the original distribution.

Here's a quickie example, all pretty pictures generated from a spreadsheet. The original distribution is the some 2000 samples of the Rand() function, a random number from 0-1:


That's actually a pretty severe distribution, flat from zero to one, then zero outside of that range. That's called "discontinuous."

Now let's take another 2000 samples and average them 2 X 2 and plot out that result:


Obviously we're a lot less squared off here. The true mean of the Rand() function, incidentally, is 0.5, and the mean of our 4000 samples is 0.498, with a standard deviation (SD, or sigma) of 0.2.

Now let's do 4 sample averages:


Now the SD has dropped to 0.14, so the edges of the original distribution are over three sigma away from the mean. That's important because, by definition, it's impossible to get an averaged sample of a value greater than 1 or less than 0. It can't happen. So the smaller the standard deviation, the less likelihood that one of those sharp tails will bite us.

By the time we get to averages of six samples:

Sigma is down to 0.118, and the distribution will pass almost every test for normality.

Now maybe Taleb would say, "But that's my point! It passes tests for normality, but it isn't a true Gaussian distribution, so if you try to draw any conclusions from this based on assumptions about it's being Gaussian, you'll make serious mistakes out in the tails of the distribution. You'll misjudge the probabilities of improbable events! That's what my book is about!"

Perhaps, says I. But I'm not the guy drawing conclusions about Gaussian tails based on the Central Limit Theorem, because I know that the CLT isn't about the tails of the distribution. It's about the mean, and the finding of it. And it's about how most of the data from summed processes is going to look more than a little like it's normally distributed. And since most processes wind up having a lot of summation of one sort or another going on, you're going to see a lot of pretty-good-approximations-to-normally-distributed data coming from your instruments, or whatever else you use to gather data.

Good scientists and good statisticians know this. And when the tails of the distribution are at issue, then you see all sorts of arguments about whether or not you're "really" dealing with a Gaussian, or a Gamma, or a Weibull, or a log-normal, or any of a dozen other statistical distributions. That's if you're a statistician. If you're a scientist you try to understand the underlying process in order to assess such things as conditional probabilities, correlations, various discontinuities, non-linearities, and, my own personal favorite, the Something We Haven't Thought of Yet.

Which is to say, I don't think we're as ignorant and stupid as Nassim Nicholas Taleb seems to think we are.

2 comments:

Anonymous said...

The CLT is about the distribution of the mean.

Now, reread the conditions on the CLT. The critical one is "sum has finite variance".

The whole point about "extremistan" variables is that they don't.

ie. CLT doesn't apply. (Or actually, it does sort of apply, resulting in something that sort of looks Gaussian, but also has a fat tail!)

The other underlying assumption with Taleb is he is an investor. He looks at the sum of his gains and losses over the years. An whilst this "almost CLT" drives the probability distributions of his wealth, the fat tail wags that dog.

James Killus said...

Now, reread the conditions on the CLT. The critical one is "sum has finite variance".

That does bring up an interesting point. One of the salient features of a "random walk" is that it is, in fact, an example of a distribution with an unbounded variance; it grows without limit over time. That gives rise to the phenomenon of "spurious correlations" when random walk variables are regressed against other variables.

That's a main reason for the "correlations" that have been found for the stock market against such oddities as sunspots and skirt hems. It also means that the Wall Street "quants," who use spreadsheet multivariate regressions at every opportunity, delude themselves on a regular basis.

I will note that Taleb is at least aware that something is wrong. And recent events have shown that, whatever Taleb's errors in mathematics, science, and statistics, there certainly are traders who are as stupid as he describes.