The best kittens, technology, and video games blog in the world.

Wednesday, December 16, 2009

Misunderstanding statistical significance

Fumer nuit gravement à votre santé. by Ségozyme from flickr (CC-NC-ND)

People are natural cargo cultists - they (I'm saying "they" as I'm secretly a cat - just don't tell anyone) have natural tendency to focus too much on superficial characteristics - if something is good, everything vaguely associated with it must be good too! Or if something is bad, everything vaguely associated with it must be bad. So as people got into their heads once that Soviet Union and its "socialism" were "bad", everything "socialist" must be "bad" - like single payer healthcare, financial system regulations, and minimum wages. Never mind that even the basic premise is wrong, as performance of Communist economies was fairly decent, and as research says, Soviet growth performance was actually slightly above the global average in the period 1960-89, at 2.4 percent per year. Eastern Europe's growth was somewhat lower than USA's, Western Europe's, or Japan's, but it managed to beat by huge margins growth in capitalist countries of Latin America, South Asian, or even UK for that matter. Not to mention China vs India, or spectacularly good performance of the "socialist" Scandinavia. Or how pre-Reagan socialist USA was much better off economically than after Reaganite revolution.

Anyway, science. Two practices of science get fetishized to a nauseating degree. They are peer review and statistical significance. I cringe every time some (let's generously say "misinformed") person says that peer review is somehow central to science, and something being "peer reviewed" is somehow evidence of quality. The only thing peer review does is enforcing standards of whichever field the research is in. For example while Dialogue: A Journal of Mormon Thought peer-reviews its articles, I doubt they're worth that much more than Youtube comments. And creationists have their own peer-reviewed journals too! And in spite of widespread use of peer review, there's very little evidence that its variations affect quality in any way.

Most disciplines are not as bad as Mormon theology or young-Earth creationism, but standards of the field have nothing to do with solid science. First, there's a problem of entirely arbitrary standards which simply waste everybody's time like this and various incompatible citation rules:
I remember when I first started writing on economics, I was scolded for formatting my papers in a two-column, single-spaced format.  While that format was common in computer science, to be taken seriously in economics a paper must be formatted as single-column, double-spaced.

But all academics eventually learn those. A much greater problem than these time-wasters are standards which are actively harmful, like...

Statistical significance

Probably the single biggest problem of science is appalling quality of statistical analysis. As a funny anecdote, I remember back at the University when I had to go through a bunch of database optimization papers - which implemented various performance hacks, and measured how well they do - and more often than not they did it on completely bogus data generated just for this case. If you spend even five seconds thinking about it, it's obviously wrong - all non-trivial optimizations highly depend on characteristics of data access patterns, and differences in results are many orders of magnitude. But for one reason or another using realistic data never got to be part of "good" database optimization research.

For one historical reason or another, almost all science got affected by obsession with "statistical significance". Typical research goes as follows:
  • Gather data
  • Throw data at some statistical package
  • Keep tweaking "hypothesis" until you get "statistically significant" answer, and send for publication
  • If no "statistically significant" answers are obtained, usually forget about the entire thing, or maybe send for publications anyway and hope it gets accepted
  • Watch people misinterpret "statistically significant" as "almost certainly true" and "not statistically significant" as "almost certainly false"
But of course, "statistically significant" results are false very often. Assuming perfectly run research, which is never true, 95% statistical significance only guarantees at most 1 false positive per 20 tested hypotheses (well, independent hypotheses, but let's skip complications here). So if you tweaked your hypothesis 50 times in your statistical package, your "significant" results are very likely to be wrong. Then according to established scientific practice you can publish a paper claiming that "broccoli reduces risk of cancer in Asian women aged 40-49 (p<0.05)", "global warming causes malaria (p<how dare you question Al Gore)", or somesuch.

Still, that's nothing compared with uselessness of something being "not statistically significant". Even otherwise smart people routinely misinterpret it as "with high probability not true". If so, let me start a study in which I will take wallets of randomly selected half of 10 such people, and we'll measure if there's any statistically significant reduction of wealth from me taking their wallets. Double blinded and all! And as science will prove no statistically significant reduction relative to control group, which court would dare to question science and convict me for doing so?

While false positives - wrongly rejecting null hypothesis - can come from either bad luck, or bad study; false negatives - not rejecting wrong null hypothesis - can come from either of those, or from insufficient sample size relative to effect strength. How strong would the effect need to be?

Let's get back to the smoking trial:
A randomised controlled trial of anti-smoking advice in 1445 male smokers, aged 40-59, at high risk of cardiorespiratory disease. After one year reported cigarette consumption in the intervention group (714 men) was one-quarter that of the “normal care” group (731 men); over 10 years the net reported reduction averaged 53%. The intervention group experienced less nasal obstruction, cough, dyspnoea, and loss of ventilatory function.

During the next 20 years there were 620 deaths (231 from coronary heart disease), 96 cases of lung cancer, and 159 other cancers. Comparing the intervention with the normal care group, total mortality was 7% lower, fatal coronary heart disease was 13% lower, and lung cancer (deaths+registrations) was 11% lower.

Which is not statistically significant. And some libertarian will surely use it to "prove" that smoking is perfectly fine, and it's all government conspiracy to mess in people's lives.

But what were the changes of getting statistically significant results? For clarity let's skip all statistical complications and do it the most brutal possible way, and even make both groups the same size. Control group size 720, intervention group size 720, true chance of death in control group 45%, true chance in intervention group 42% (7% reduction in mortality) - we just don't know it yet. Statistical significance levels 95%. So control group had 95% chance of getting between 298 and 343 deaths (I'll skip the issue of one-sided and two-sided tests of significance as the entire idea is highly offensive to Bayesians). Chance of intervention group having fewer deaths than 298 - merely 38%. So assuming entirely implausibly that this 1440-person-strong 20-year study was perfectly run, there's be 62% chance that the results will be worthless because 1440 people is very far from enough. Except as maybe fodder for a meta-analysis.

By the way reduction is merely 7% as what was studied was "trying to convince people to stop smoking". Most people wouldn't be convinced or would relapse; and many in control group would stop smoking on their own.

Anyway, how many people would we need for study with 45% and 42% death rates groups to have less than 5% chance of both false positive (conditional on null hypothesis being true) and false negative (conditional on null hypothesis being false)? 3350 in each group, or 7100 altogether. And that was smoking - the biggest common health risk we know. How many people would we need if we studied normal-risk people, let's say 10% death rates during study time, and levels of relative risks typical for diet/lifestyle intervention, let's say 1%? Two times 1.2M, or 2.4M people. More than the entire population of Latvia. And that's before you consider how highly non-random are dropouts from such studies, and how they will swamp out any results we would get. And any double-blindness here?

In all likelihood, we will never get any data about effects of this kind of interventions. Science will have no idea if fish oil cures cancer, vegetables lower risk of obesity, or if organic is better for you than non-organic.

Hammy the Hamster tries organic food
The only thing we will have will be lower quality indirect studies with highly ambiguous interpretations, which we might actually have to accept. That, and misunderstandings of statistical significance.


Xianhang Zhang said...
This comment has been removed by the author.
Xianhang Zhang said...

I wrote a blog post a while back about how to fool people using statistical significance: Not statistically significant and other statistical tricks

taw said...

Xianhang Zhang: Thanks, I really like your posts over there.

Divided Mind said...

That 'which is never true' reference seems to be gated or otherwise unavailable.