Unintentional Irony: statistics

Showing posts with label statistics. Show all posts

Monday, January 28, 2008

How Do You Get to Carnegie Hall?

When I was a regular on the old Compuserve Science Forum, the I.Q. debate came up fairly frequently, and one of the standard tangents was to start talking about sports and athletics. It is, after all, easier to see the consequences of athletic ability, and there are many, many statistics that are generated in sports as a matter of course, without having to create special tests to gather them.

One of the participants in one discussion reported that there was a considerable literature attempting to attribute success in sports to some separately testable ability, or even to correlate cross-sport success, to account for Bo Jackson, for example, although the case of Michael Jordan was also mentioned.

At any rate, our correspondent reported that all such statistical studies of sports yielded one primary component that correlated with success: the amount of time spent practicing and playing the sport in question. Everything else just vanished into the noise.

Now the "talent" side of the talent vs practice argument can come up with a quick explanation for the practice effect: self-selection. People who are naturally gifted at a given sport tend to enjoy playing and practicing that sport, so they practice more, etc. The difficulty with that explanation is that it is devilishly difficult to predict who is "talented" at a particular sport a priori, at least insofar as creating a testing matrix that will give a good prediction of later success.

There is a similar argument that has recently been made by the authors of Freakonomics.

Dubner and Levitt cite data that suggests that children who are some months older when they begin to play soccer (football to you Euros), owing to what time of the year the children were born, tend to be more heavily represented in professional teams. The argument here is that those few extra months of growth make the children a bit larger and quicker (on average) when they begin to play, so they have an extra edge, tend to do better from the beginning, and, hey, who doesn't spend more time on something that they're better at?

Notice, however, that we're talking about some pretty slim differences in ability here, equivalent to a few months worth of extra maturity, with the differences diminishing over time. So what you're really getting is a very strong feedback loop, where a little extra preliminary edge translates into huge differences in training and practice over time.

The Freakonomics source for this material is K. Anders Ericsson, who was probably the source for my original correspondent on Compuserve (thus do loops close on the Internet). Ericsson has been studying the practice effect phenomenon for several decades, and he's quite convinced that there is much less to talent than commonly believed. He does note that there are some sports that have physical constraints (no one has ever had to make the choice between being a Sumo wrestler or a jockey, for example), but other than that, Ericsson seems to believe that it's all about practice, and, more importantly, effective practice.

Again, drawing from sports, records are still being broken in most sports, on a regular basis. At some point, the total time spent on practice fills all available time, so there must be some improvements in how athletes practice in order for improvements to continue. And even if the improvements are such things as the use of anabolic steroids and HGH, there are still more (and less) effective ways of using those things. Weight training without steroids is more effective than the other way around, for example.

There's one other phenomenon that is often overlooked in the talent/practice debate, however, and that is the random factor. In sports, that is most obvious in how injuries can derail careers. Certainly some of the improvements in training over the past century involve learning how to train with less risk of being injured, but in the heat of the game, injuries happen. Then, later, great pressure is placed on the athlete to "play through it," and thereby increase the severity of the injury. It takes a really tough-minded coach to demand that a player wait until fully recovered before resuming competition. Most are going to go for the appearance of tough-mindedness; after all, there are always plenty more guys who want to play the game. Besides, it's hard to keep up the practice from the sidelines.

Wednesday, October 31, 2007

Significant

Hmm. I may have a bit of a block on this. Let's just try to bull ahead then, and see what happens.

Over on Mark Thoma's Economist's View blog, there were a couple of discussions about a, well, let's call it a "raging debate," albeit one in fairly slow motion. The backstory papers are here:

McCloskey and Ziliak, "The Standard Error of Regressions," Journal of Economic Literature 1996.

Ziliak and McCloskey, "Size Matters: The Standard Error of Regressions in the American Economic Review," Journal of Socio-Economics 2004.

Hoover and Siegler, "Sound and Fury: McCloskey and Significance Testing in Economics," Journal of Economic Methodology, 2008.

McCloskey and Ziliak, "Signifying Nothing: Reply to Hoover and Siegler."

These papers were pulled from an entry on "Significance Testing in Economics" by Andrew Gelman, and there followed two discussions at Economist's View:

"Tests of Statistical Significance in Economics" and later, a response by one of the main players (McCloskey), followed by my arguing with a poster named notsneaky. That led to my essay, "The Authority of Science."

Okay, you are allowed to say, "Yeesh."

So let me boil down some of this. McCloskey published a book in 1985, entitled, The Rhetoric of Economics, in which she argued that the term "Statistical Significance" occupied a pernicious position in economics, and some other sciences. The 1996 paper by McCloskey and Ziliak (M&Z) continued this argument, and the 2004 paper documented a quantitative method for illustrating the misuse of statistics that derived from what was, basically, an error in rhetoric, the connecting the word "significant" to certain sorts of statistical tests. The forthcoming (to be published in 2008, the link is to a draft) paper by Hoover and Siegler (H&S) finally rises to the bait, and presents a no-holds-barred critique of M&Z. Then M&Z reply, etc.

Any of my readers who managed to slog through my criticisms of the use of the word "rent" (See "Playing the Rent" and subsequent essays) in economics (as in "rent-seeking behavior"), will understand that I start off on the side of the rhetoriticians. When a technical subject uses a word in a special, technical sense that is substantially different from its common language use, there is trouble to be had. "Significant" carries the meaning of "important," or "substantial" around with it, but something that is "statistically significant" is simply something that is statistically different from the "null hypothesis" at some level of probability. Often, that level of probability is arbitrarily set to a value like 95%, or two standard deviations, two sigma, which is about 98% for a normal distribution.

(I'll note here that in statistical sampling, one usually uses something like the t-distribution, which only turns into the normal distribution when the number of samples is infinite, so it adds additional uncertainty for the size of the sample. The t-distribution also assumes that the underlying distribution being sampled is normal, which is rarely a good assumption at the levels of reliability that are being demanded, so the assumption train has run off the rails pretty early on).

But some differences make no difference. Given precise enough measurements, one can certain establish that one purchased pound of ground beef is actually one and one thousandths of a pound, but no one who purchased it would feel that they were getting a better deal than if they'd gotten a package that was one thousandth of a pound light. We just don't care about that small a difference; some of the beef is going to stick to the package.

I saw something written recently that referred to something as "statistically reliable," and on the face of it, that would be a much better phrase than "statistically significant," and I will use it hereafter, except when writing about the misused phrase, which I will put in quotes.

So, okay, "statistically significant" is not necessarily "significant." Furthermore, everyone agrees that this is so. But one disagreement is whether or not everyone acts as if this were so. And that is where M&Z's second criticism comes in: that many economics journals (plus some other sciences) simply reject any paper that does not show results at greater than 95% reliability, i.e. the results must be "statistically significant." M&Z say outright that the level of reliability should adapt to the actual importance of the question at hand.

The flip side of this is that, in presenting their work, authors sometimes use "statistically significant" as if it really mean "significant" or "important," rather than just reliable.

Alternately, one can simply report the reliability statistic, the so-called "p value," which is a measure of how likely the result is to have come about simply because of sampling error. I have, for example, published results with p values of 10%, meaning that there was one chance in 10 of the result being just coincidence. I've seen some other p values that were much lower, and those are usually given in the spirit of "there might be something here worth knowing, so maybe someone should do some further work."

In fact, this giving lower p values, or using error bars at the single sigma level, is fairly standard practice is some sciences, like physics, chemistry, geology, and so forth. Engineers usually present things that way as well. On the other hand, the vague use of "significant" that M&Z criticize is often used in social sciences other than economics, e.g. psychology and sociology, as well as some of the biological sciences, including especially, medicine.

It's in medicine where all this begins to get a tad creepy. In one of their papers, M&Z refer to a study (of small doses of aspirin on cardiovascular diseases like heart attack and stroke) as having been cancelled, for ethical reasons, before the results reached "statistical significance." "Ha!" exclaim H&S (I am paraphrasing for dramatic effect). "You didn't read the study, merely a comment on it from elsewhere! In fact, when the study was terminated, the aspirin was found to be beneficial to myocardial infarction (both lethal and non-lethal) at the level of p=0.00001, well past the level of statistical significance! It was only stroke deaths and total mortality that had not reached the level of p=0.05!"

Well, that would surely score points in a high school debate, but let's unpack that argument a bit. M&Z say that the phrase "statistically significant" is used as a filter for results, and what do H&S do? They concentrate on the results that were found to be statistically reliable at a high level. How about the stroke deaths? What was the p value? H&S do not even mention it.

(As an aside, I will note that the very concept of a p value of 0.00001 is pretty ridiculous. Here we have an example of the concept of statistical reliability swamping actual reliability. The probability of any distribution perfectly meeting the underlying statistical assumptions of the t-distrubution is indistinguishable from zero, and the likelihood of some other confounding factor intervening at a level of more than once per hundred thousand is nigh onto one).

Furthermore, H&S use a little example involving an accidental coincidence of jellybeans seeming to cure migraines to show why one must use "statistical significance." Then, when discussing the aspirin study, they invoke the jellybean example. On the face of it, this looks like they are equating migraines with heart attacks and strokes, again, completely ignoring the context in which samples are taken, in order to focus on the statistics. In many ways, it looks like H&S provide more in the way of confirming examples of M&Z's hypothesis than good arguments against it.

Also consider what H&S are saying about the aspirin study, that there was a period of time when members of the control group were dying, when the statistical reliability of the medication had been demonstrated, but the study had yet to be terminated. Possibly the study did not have an ongoing analysis, and depended upon certain predetermined analysis and decision points. But how would such points be selected? By estimating how long it would take for the numbers to be "statistically significant?"

Some studies definitely do use ongoing statistical analyses. Are there really studies where a medication has be been shown to be life-saving, to a statistical reliability of 90%, where patients are still dying while the analysts are waiting for the numbers to exceed 95%? How about cases where medications are found to have lethal side effects, but remain on the market until the evidence exceeds "statistical significance?"

The blood runs a little cold at that, doesn't it?

Sunday, October 28, 2007

The Authority of Science

In Conservatives Without Conscience, John Dean relates a description of Authoritarian Followers, who form the greatest mass of authoritarian psychology. There are three basic underpinnings of authoritarian psychology:

Submission to Authority
Aggressive Support for Authority
Conventionality

It's easy to go deeper, of course, to fear of the unknown, a desire for stability, and an intolerance for ambiguity, all related, all consistent. Similarly, it's easy to find traits that the three basic traits can lead to. Sara Robinson on Orcinus blog has done some heavy lifting in condensing Dean's arguments to clarify the case, especially in her "Cracks in the Wall" series, that begins here.

We tend to speak disparagingly of authoritarians; the word itself is pejorative, as evidenced by the fact that people whom you or I would call authoritarian will simultaneously point the finger at us.

To some extent, they would have a point, in that Rightful Authority does exist, insofar as there are rules and laws of both man and the natural world, and there are networks of people (lawmakers, courts, and lawyers on the one hand, scientists, engineers, and technicians on the other), whose authority within their own domain matters. Courts can send you to jail, and your plumber can clear the water line.

Some of the examples of rightful authority are trivial, and some are deeply held, supported aggressively by their constituents. If you are visiting a construction site you damn well do what the site manager tells you to do. If you don't, someone could get hurt. And if someone you are with doesn't follow the rules, you help keep him in line.

The fact is that most conventional wisdom is, in fact, wisdom, just as each of us makes a thousand decisions every day, almost every one of them right, and almost every one of the wrong ones gets immediately corrected. We don't fall down the stairs, and we don't turn left when we should have turned right on the way to work.

I doubt that anyone who has read even a small fraction of what I've written who would take me as a servant of conventionality, and I score very, very low on tests that measure authoritarian tendencies. But I seldom break rules for the sake of breaking rules (well, okay, sometimes I break a rule just to see what will happen, but come on now, we've got to have a little fun). I do believe in process, and I do believe in rigor.

So I want it understood that my intention here is descriptive more than it is proscriptive. It's part of a fight I've been having (or so I come to understand) for a long time now. I've come to understand that it's a necessary fight, a perpetual fight, and a fair fight.

When I was starting out in the then bright shiny new field of computer simulation modeling, I quickly came to believe that the models we were using were complicated machines that needed an experienced and talented operator. There were a lot of moving parts, and a lot of things that could screw up. Moreover, you needed to have a feel for what you were doing; the sort of pattern recognition that we call "intuition" mattered.

You could not, as the saying goes, "Just turn the crank."

In fact, the people who were paying the bills wanted exactly that. They wanted something with a crank that could be turned to grind out the right answers. I've come to understand that there is a good reason for wanting that, incidentally. If you just turn the crank, if you don't know exactly what's going on, and what buttons to push to twist the results, then it looks like it's less likely that someone will cheat and deliberately give results to advance a particular cause.

But just because there is a good reason for wanting something doesn't make it ultimately work out that way. If you have several groups doing the work and you pick and choose whichever one accidentally gives you what you are looking for, then you've accomplished essentially the same thing as if you had someone twisting the knobs. If there is an accepted protocol, you use whatever flexibility exists in the protocol to give the results closest to what you want. You "game the system," to use the current phrasing.

And there you have the argument for the most inflexible possible system. It is less likely to be gamed.

Science (and scholarship, and the law, and tradition) exists in a constant tension between protocol and choice, between methodology and judgment. Some people, because of whatever combination of their experience and natural tendencies, wind up being strong adherents to, and defenders of, protocol and methodology. Others emphasize insight and judgment. As a strong proponent of insight and judgment, let me stipulate the importance of protocol and methodology. The methodology and protocols of science represent a tradition of judgment and a condensation of insight that cannot be replaced by a single individual, no matter how smart and clever.

Still, we must use judgment and insight to examine methodology and protocol, on an ongoing basis. Otherwise, the process of science becomes stagnant, at best in an endless loop of endlessly reinventing the wheel, just maybe this year in a different color.

All of this is basically an epilog and an introduction. There is a very interesting debate that has flared up in statistical economics, about the phrase "statistically significant" what it means as a descriptor and as a protocol. There is also an element of another phenomenon that I've remarked upon before: the use of a technical term that is substantively different from its meaning in ordinary speech, which means that it can have a pernicious effect on thought.

And, of course, I had an argument about it. But I'll write about that in a later essay.

Sunday, June 24, 2007

Central Limit

My sister gave me a book titled, The Black Swan (The Impact of the Highly Improbable), by Nassim Nicholas Taleb, whose previous book, Fooled by Randomness, was supposedly a bestseller. Taleb is a former Wall Street trader in derivatives, and he made a lot of money thereby, which, ironically enough, he claims is meaningless, a product of luck and a privileged position (and my readers would know that I agree with him on that one), but which nevertheless is almost certainly why he was given a book deal in the first place.

With similar irony, he spends substantial amounts of print in The Black Swan railing against the human tendency to substitute narrative for data, but, of course, the way he conveys his point is via life stories and anecdotes (some of which are fictional). He also insults the French a lot.

Well, golly, he certainly is "stimulating," by which I mean both wrongheaded and just plain wrong about many subjects that I find interesting. I'm not going to attempt a review of his book, because one simply does not try to swat flies in a barnyard, but I will use Mr. Taleb's book as an excuse to write about a few things that it reminds me to write about. One of them, (and god help you who are reading this) is the Central Limit Theorem (CLT).

Taleb writes quite a bit about statistics and their use and misuse, and I'd be there, dude, were it not for the part about his being severely wrong. Much of the time. Indeed, whenever I'm confronted with someone arguing about the use of statistics, I check to see if they have anything to say about the Central Limit Theorem, because that's were the muckup usually begins. Taleb, it's true, makes almost exactly the opposite error that's usually made, but it turns out that being against people who are wrong is not the perfect path to truth that one might hope it to be.

So here is what he says (note on page 325):

The notion of central limit: very misunderstood; it takes a long time to reach the central limit-so as we do not live in the asymptote, we've got problems. All various random variables (as we started [sic] in Chapter 16 [actually Chapter 15-JK] with a +1 or -1, which is called a Bernoulli draw) under summation (we did sum up the wins of the 40 tosses) become Gaussian. Summation is the key here, since we are considering the results of adding up the 40 steps, which is where the Gaussian, under the first and second central assumptions becomes what is called a "distribution." (A distribution tells you how you are likely to have your outcomes spread out, or distributed). However, they may get there at different speeds. This is called the central limit theorem; if you add random variables coming from these tame jumps, it will lead to the Gaussian.

Ah, where to begin. The example that Taleb refers to is a coin flipping sequence that is actually a "Binomial Distribution" that does indeed converge to a good replica of the Gaussian, but that has little to do with the Central Limit Theorem.

Taleb seems to have learned that the CLT has something to do with addition, and that it says that things tend toward a Gaussian (also called "normal") distribution. From there on it's pretty much (to adapt a phrase from Mel Brooks) "authentic techno-gibberish." It's the sort of thing that someone writes when they are trying to snow you.

Here's the deal. The Central Limit Theorem says that, for almost any distribution of numbers that can be produced (you are limited to finite numbers), if you take a large enough sample of those numbers, and average the samples together, the averages of the samples will form an approximately Gaussian distribution around the true mean of the original distribution.

Here's a quickie example, all pretty pictures generated from a spreadsheet. The original distribution is the some 2000 samples of the Rand() function, a random number from 0-1:

That's actually a pretty severe distribution, flat from zero to one, then zero outside of that range. That's called "discontinuous."

Now let's take another 2000 samples and average them 2 X 2 and plot out that result:

Obviously we're a lot less squared off here. The true mean of the Rand() function, incidentally, is 0.5, and the mean of our 4000 samples is 0.498, with a standard deviation (SD, or sigma) of 0.2.

Now let's do 4 sample averages:

Now the SD has dropped to 0.14, so the edges of the original distribution are over three sigma away from the mean. That's important because, by definition, it's impossible to get an averaged sample of a value greater than 1 or less than 0. It can't happen. So the smaller the standard deviation, the less likelihood that one of those sharp tails will bite us.

By the time we get to averages of six samples:

Sigma is down to 0.118, and the distribution will pass almost every test for normality.

Now maybe Taleb would say, "But that's my point! It passes tests for normality, but it isn't a true Gaussian distribution, so if you try to draw any conclusions from this based on assumptions about it's being Gaussian, you'll make serious mistakes out in the tails of the distribution. You'll misjudge the probabilities of improbable events! That's what my book is about!"

Perhaps, says I. But I'm not the guy drawing conclusions about Gaussian tails based on the Central Limit Theorem, because I know that the CLT isn't about the tails of the distribution. It's about the mean, and the finding of it. And it's about how most of the data from summed processes is going to look more than a little like it's normally distributed. And since most processes wind up having a lot of summation of one sort or another going on, you're going to see a lot of pretty-good-approximations-to-normally-distributed data coming from your instruments, or whatever else you use to gather data.

Good scientists and good statisticians know this. And when the tails of the distribution are at issue, then you see all sorts of arguments about whether or not you're "really" dealing with a Gaussian, or a Gamma, or a Weibull, or a log-normal, or any of a dozen other statistical distributions. That's if you're a statistician. If you're a scientist you try to understand the underlying process in order to assess such things as conditional probabilities, correlations, various discontinuities, non-linearities, and, my own personal favorite, the Something We Haven't Thought of Yet.

Which is to say, I don't think we're as ignorant and stupid as Nassim Nicholas Taleb seems to think we are.

Unintentional Irony

Monday, January 28, 2008

How Do You Get to Carnegie Hall?

Wednesday, October 31, 2007

Significant

Sunday, October 28, 2007

The Authority of Science

Sunday, June 24, 2007

Central Limit

About Me

SunSmoke

Links

Subscribe Now: Feed Icon

Blog Archive