taw's blog: science

Showing posts with label science. Show all posts

Tuesday, October 12, 2010

Blogging about Robin Hanson is not about blogging about Robin Hanson

Fancy, snug in my bed by Hairlover from flickr (CC-BY)

This post is a semi-confession, and the difference between "I find X" statements and "X" statements is significant.

When I don't like something, I can of course come with some reasonable-sounding arguments specific to the particular subject, but usually "X is not really about X, it's just status seeking game gone awry" has no trouble getting onto such list. Sometimes high, sometimes not, but it's rarely missing.

When I like something, I tend to believe "X is not about X" type of arguments don't really apply - maybe there's some status seeking involved, but it surely cannot be that significant.

So here's my main point - "X is not about X" is not about X not being about X. It is a cheap and hard to refute shot at X to lower its status - essentially "X is not about X" argument is really about raising speaker's status relative to X.

Science is not about science

For example I find it intuitively obvious that academic science is mostly silly status game, with just tiny fraction of effort directed towards seeking important truth about reality.

How else can you explain:

persistent disregard for statistics, reliance on cargo cult statistics like p-values and PCA
vast majority of papers not getting even a single replication attempt
most research being about irrelevant minutiae
nobody bothered by research results being not publicly available online
review by "peers", not by outsiders
obsession about citations
obsession regarding where something gets published
routinely shelving results you don't like

This is exactly what we'd expect if it was just one big status seeking game! Truth seeking would involve:

everything being public from even before experiments start
all results being published online immediately, most likely on researcher's blog
results being routinely reviewed by outsiders with most clue about statistics and methodology at least as much as by insiders with domain knowledge
journals wouldn't exist at all
citations wouldn't matter much, hyperlinks work just as well - and nobody would bother with silliness like adding classics from 1980s to bibliography like it's still commonly done in many fields
most research focusing on the most important issues
vast majority of effort being put towards reproducing others' research - if it was worth studying, it's worth verifying; and if it's not worth verifying, why did anyone bothered with original research in the first place?
serious statistical training as part of curriculum in place of current sham in most disciplines

It's a miracle that anything ever gets discovered at all! It would be a very cheap shot, but it's really easy to finish this line of reasoning by attributing any genuine discovery as an attempt at acquiring funding from outsiders so the status game can continue. And isn't there a huge correlation between commercial funding of science and rate of discovery?

The Not-So-Fat One - playing with water by jeff-o-matic from flickr (CC-NC-ND)

Medicine is definitely about health

And here's something I like - modern medicine. It's just obvious to me it's about health and I find claims to the contrary ridiculous.

Yes, plenty of medical interventions have only weak evidence behind them, but this is true about anything in the universe. Hard evidence is an exception (also see previous section) and among few things having hard evidence, medical interventions are unusually well represented.

And yes, a few studies show that medical spending might not be very beneficial on the margin - mostly in short term, mostly in United States, mostly on small samples, always without replication, and I could go on like this.

Anyway, modern medicine has enough defenders without me getting involved, so I'll just stop here. To me it just looks like what you'd get if it was about health, and it doesn't look like what you'd get if it was about status seeking, even if a tiny bit of it gets into it somehow.

But notice how the argument goes. Conclusions first, and you can always fit or refute status seeking in arguments later. This bothered me until I've seen even bigger point - it all works as well when you replace "seeking status" with "making money" or any other motivator!

Compare these semi-silly examples:

Writing books is not about content, it's about money
Writing blog posts is not about content, it's about status
Food industry is not really about food, it's about money
Organic food is not really about food, it's about status
Commercial software is not really about software, it's about money
Open Source is not really about software, it's about status

This is just universal and an argument like that can be applied to nearly everything. But in all these money and status are only mechanisms of compensation - people optimize for money or status because that's what people are good at, but the only reason there's a connection between money/status and given action or product is underlying connection with desirable outcome.

To pull an Inception - "X is about status, not X" is about status, not about X being about status, not X.

PS. Here's a paper claiming that 2/3 of highly cited papers had serious problems getting published at all because of peer review. How many breakthroughs didn't get published? How many got published and then ignored? How many people didn't even bother, limiting themselves to standard useless minutiae, or staying away from academic science entirely?

PPS. Here's a few more relevant links:

Basic statistics showing "Why Most Published Research Findings Are False".
Why replication matters and how little research survives it. Of over 600 published gene-disease associations claims, 6 survived replication, 160 failed replication, and for most nobody even bothered checking - not a big surprise with 1% success rate.
Scientists who spend time refuting false ideas are punished (paper gated).

Notice how except for medicine nobody seems bothered by this.

Wednesday, April 28, 2010

Boobquake experiment is bad science

Mimi Catblue Dynamite by evilfenn from flickr (CC-NC-SA)

In spite of my best attempts I couldn't find any relevant cats, so as a next best thing I included some catgirls possibly participating in the Boobquake experiment

I'm as appalled by low standards of modern science as cleris Hojatoleslam Kazem Sedighi is about skimpy female outfits. And yesterday, we both had a chance to be outraged at the same thing - boobquake experiment makes travesty of science!

Background

The conflict comes from a long accepted paradigm of natural disasters, which explained them as God's punishment for human sins. Research supporting this theory goes as far back as the famous Lot's experiment [Gen 19], and countless other papers of similar antiquity. It has been very widely believed until even as recently as mid-18th century, when it was the most commonly believed explanation for the 1755 Lisbon earthquake - according to sources of the time supposedly due to slutiness of Portuguese women, and other human sins.

This paradigm has been largely replaced by current naturalistic paradigm which claims such events are mostly "random" and have no particular reason - replacement motivated more by wish to remove God from the picture than by any hard scientific data, and by crude analogy with cases like creationism vs evolution in which similar removal of God was actually supported by large volume of evidence.

So it should be no surprise that if even a well researched case of evolution still has many doubters, it should be even more true for less researched case of causes of earthquakes and other natural disasters, where the main argument is compatibility with Western secularism.

Given all this, we should be grateful that someone finally tried running an experiment to check the sin theory of natural disasters. Unfortunately the experiment was so far from standards of good science as to render it totally unconvincing.

What went wrong with Boobquake experiment

The boobquake experiment was an attempt to test the sin theory of natural disasters by having a large number of women wear sluttier outfits than usual, make God angry, and cause increased number of earthquakes.

Before even looking at its scientific value we should notice that such experiment grossly violates modern ethical standards by risking lives of countless people who didn't volunteer for it. In the future it would be prudent to at least limit such experiments to a singre geographic location, where all sinners would group before commencing the sinning - and which non-sinners would be able to evacuate in advance.

Now, for the experiment. No attempt was made to establish baseline sin level, or even baseline outfit skimpiness level - and unfortunately we have no way of knowing how large was the change due to boobquake relative to the daily fluctuation of sin. Even the most basic estimates suggest the effect to be minor - supposedly about 200,000 women tried to wear sluttier than normally on that day - such number is far below changes in a single country known for sinning like Sweden due to daily weather variations. Experiment also takes place in the aftermath of Eyjafjallajökull eruption (believed by some to be caused by the deadly sin of greed of Icelanders) - which caused major disruption of tourism, and it is universally accepted that people sin a lot more during holidays.

Moreover, this sample was not random. Experiment participants have been self-selected, and we have reasons to believe that people particularly unconcerned with God's wrath, and therefore with higher than baseline sinfulness level - probably constituted vast majority.

While data on most participant is missing, comparison between be the supposedly "most scandalous" outfit of the main researcher she wore during the experiment with her profile photo causes serious doubts, as they're about equally revealing.

In addition to lack of statistical soundness, and sample randomness, no attempt was made to blind the experiment. This difficulty is inherent, as God's omniscience makes proper double-blinding impossible, but participants knew well in advance that they're assigned to treatment group, so they may have consciously or subconsciously reduced their other kinds of sinning, a blatant violation of basic rules of scientific testing that would pass no peer review. What's worse, no attempt was even made to establish a control group!

Now it could be argued that blinding participants would be difficult - or that sinning requires conscious action, so both treatment and control groups would be equally in state of sin by even risking wearing revealing clothing - but this difficulty is present in many experiments, and at least serious attempt should be made to reduce placebo effect when it cannot be fully eliminated. Numerous examples of women wearing clothing far more revealing that they thought easily found on the Web suggest this kind of blinding is not fully impossible.

Not only experiment was of dubious quality, the hypothesis tested is far from the standard theory of sin causing natural disasters. Now the theory has many variations but even the basic Sodom and Gommorah version clearly disagrees with research assumptions:

Assumption was made of no delay between sin and punishment, while the theory says punishment only comes after sinning.
Assumption was made of near-linearity between sin and punishment, while the theory says God was willing to spare Sodom and Gomorrah if even as few as ten righteous people were found.

Given all these problems, the Boobquake experiment brings essentially no new knowledge to the discipline of causes of natural disasters, and we can only hope that future experiments are better planned, with better design, better experimental controls, and detailed publicly available photographic documentation of sinning.

Catgirls, CascadiaCon, Seattle, WA by djwudi from flickr (CC-NC-SA)

With baseline like this, it is dubious Boobquake really changes average sinning level much

Related research

While Boobquake is clearly bad science, a much more relevant research was done by IPCC reports, which claim that we can expect high correlation between global warming and extreme weather effects.

This strongly confirms sin theory of natural disasters, as warmer weather causes many women to wear more revealing outfits, the results being among others "increased droughts, tropical cyclone activity, and tsunamis" (but curiously no mention of earthquakes).

Now correlation doesn't equal causation, and IPCC research doesn't directly test the sin theory, but it arguably provides stronger evidence than Boobquake experiment.

In the future, one would hope solid systematic research is done on variations of female outfit skimpiness - an area that has long been ignored by mainstream science.

Wednesday, December 16, 2009

Misunderstanding statistical significance

Fumer nuit gravement à votre santé. by Ségozyme from flickr (CC-NC-ND)

People are natural cargo cultists - they (I'm saying "they" as I'm secretly a cat - just don't tell anyone) have natural tendency to focus too much on superficial characteristics - if something is good, everything vaguely associated with it must be good too! Or if something is bad, everything vaguely associated with it must be bad. So as people got into their heads once that Soviet Union and its "socialism" were "bad", everything "socialist" must be "bad" - like single payer healthcare, financial system regulations, and minimum wages. Never mind that even the basic premise is wrong, as performance of Communist economies was fairly decent, and as research says, Soviet growth performance was actually slightly above the global average in the period 1960-89, at 2.4 percent per year. Eastern Europe's growth was somewhat lower than USA's, Western Europe's, or Japan's, but it managed to beat by huge margins growth in capitalist countries of Latin America, South Asian, or even UK for that matter. Not to mention China vs India, or spectacularly good performance of the "socialist" Scandinavia. Or how pre-Reagan socialist USA was much better off economically than after Reaganite revolution.

Anyway, science. Two practices of science get fetishized to a nauseating degree. They are peer review and statistical significance. I cringe every time some (let's generously say "misinformed") person says that peer review is somehow central to science, and something being "peer reviewed" is somehow evidence of quality. The only thing peer review does is enforcing standards of whichever field the research is in. For example while Dialogue: A Journal of Mormon Thought peer-reviews its articles, I doubt they're worth that much more than Youtube comments. And creationists have their own peer-reviewed journals too! And in spite of widespread use of peer review, there's very little evidence that its variations affect quality in any way.

Most disciplines are not as bad as Mormon theology or young-Earth creationism, but standards of the field have nothing to do with solid science. First, there's a problem of entirely arbitrary standards which simply waste everybody's time like this and various incompatible citation rules:

I remember when I first started writing on economics, I was scolded for formatting my papers in a two-column, single-spaced format. While that format was common in computer science, to be taken seriously in economics a paper must be formatted as single-column, double-spaced.

But all academics eventually learn those. A much greater problem than these time-wasters are standards which are actively harmful, like...

Statistical significance

Probably the single biggest problem of science is appalling quality of statistical analysis. As a funny anecdote, I remember back at the University when I had to go through a bunch of database optimization papers - which implemented various performance hacks, and measured how well they do - and more often than not they did it on completely bogus data generated just for this case. If you spend even five seconds thinking about it, it's obviously wrong - all non-trivial optimizations highly depend on characteristics of data access patterns, and differences in results are many orders of magnitude. But for one reason or another using realistic data never got to be part of "good" database optimization research.

For one historical reason or another, almost all science got affected by obsession with "statistical significance". Typical research goes as follows:

Gather data
Throw data at some statistical package
Keep tweaking "hypothesis" until you get "statistically significant" answer, and send for publication
If no "statistically significant" answers are obtained, usually forget about the entire thing, or maybe send for publications anyway and hope it gets accepted
Watch people misinterpret "statistically significant" as "almost certainly true" and "not statistically significant" as "almost certainly false"

But of course, "statistically significant" results are false very often. Assuming perfectly run research, which is never true, 95% statistical significance only guarantees at most 1 false positive per 20 tested hypotheses (well, independent hypotheses, but let's skip complications here). So if you tweaked your hypothesis 50 times in your statistical package, your "significant" results are very likely to be wrong. Then according to established scientific practice you can publish a paper claiming that "broccoli reduces risk of cancer in Asian women aged 40-49 (p<0.05)", "global warming causes malaria (p<how dare you question Al Gore)", or somesuch.

Still, that's nothing compared with uselessness of something being "not statistically significant". Even otherwise smart people routinely misinterpret it as "with high probability not true". If so, let me start a study in which I will take wallets of randomly selected half of 10 such people, and we'll measure if there's any statistically significant reduction of wealth from me taking their wallets. Double blinded and all! And as science will prove no statistically significant reduction relative to control group, which court would dare to question science and convict me for doing so?

While false positives - wrongly rejecting null hypothesis - can come from either bad luck, or bad study; false negatives - not rejecting wrong null hypothesis - can come from either of those, or from insufficient sample size relative to effect strength. How strong would the effect need to be?

Let's get back to the smoking trial:

A randomised controlled trial of anti-smoking advice in 1445 male smokers, aged 40-59, at high risk of cardiorespiratory disease. After one year reported cigarette consumption in the intervention group (714 men) was one-quarter that of the “normal care” group (731 men); over 10 years the net reported reduction averaged 53%. The intervention group experienced less nasal obstruction, cough, dyspnoea, and loss of ventilatory function.

During the next 20 years there were 620 deaths (231 from coronary heart disease), 96 cases of lung cancer, and 159 other cancers. Comparing the intervention with the normal care group, total mortality was 7% lower, fatal coronary heart disease was 13% lower, and lung cancer (deaths+registrations) was 11% lower.

Which is not statistically significant. And some libertarian will surely use it to "prove" that smoking is perfectly fine, and it's all government conspiracy to mess in people's lives.

But what were the changes of getting statistically significant results? For clarity let's skip all statistical complications and do it the most brutal possible way, and even make both groups the same size. Control group size 720, intervention group size 720, true chance of death in control group 45%, true chance in intervention group 42% (7% reduction in mortality) - we just don't know it yet. Statistical significance levels 95%. So control group had 95% chance of getting between 298 and 343 deaths (I'll skip the issue of one-sided and two-sided tests of significance as the entire idea is highly offensive to Bayesians). Chance of intervention group having fewer deaths than 298 - merely 38%. So assuming entirely implausibly that this 1440-person-strong 20-year study was perfectly run, there's be 62% chance that the results will be worthless because 1440 people is very far from enough. Except as maybe fodder for a meta-analysis.

By the way reduction is merely 7% as what was studied was "trying to convince people to stop smoking". Most people wouldn't be convinced or would relapse; and many in control group would stop smoking on their own.

Anyway, how many people would we need for study with 45% and 42% death rates groups to have less than 5% chance of both false positive (conditional on null hypothesis being true) and false negative (conditional on null hypothesis being false)? 3350 in each group, or 7100 altogether. And that was smoking - the biggest common health risk we know. How many people would we need if we studied normal-risk people, let's say 10% death rates during study time, and levels of relative risks typical for diet/lifestyle intervention, let's say 1%? Two times 1.2M, or 2.4M people. More than the entire population of Latvia. And that's before you consider how highly non-random are dropouts from such studies, and how they will swamp out any results we would get. And any double-blindness here?

In all likelihood, we will never get any data about effects of this kind of interventions. Science will have no idea if fish oil cures cancer, vegetables lower risk of obesity, or if organic is better for you than non-organic.

Hammy the Hamster tries organic food

The only thing we will have will be lower quality indirect studies with highly ambiguous interpretations, which we might actually have to accept. That, and misunderstandings of statistical significance.

Monday, January 28, 2008

PageRank - a new addition to your Data Processing Toolkit

Back when I was at Universität des Saarlandes we had a great seminar at MPII. It was called "Data Processing Tips and Tricks" and covered some important data processing techniques. These techniques varied a lot. What they had in common was their universality - you could throw pretty much any data at them and extract something. And by preprocessing your data differently, and postprocessing the results differently, you could get loads of interesting things without inventing any new algorithms.

Here's a partial list of the classic universal data processing methods, in no particular order:

Bayesian statistics
EM algorithm
Wavelets
Levenberg-Marquardt optimization
Lloyd clustering
Karhunen-Loeve Transform (aka Principal Component Analysis)
Singular Value Decomposition
multi-dimensional scaling
Algebraic reconstruction technique
Support Vector Machines
Graph Cut Optimization
Level-Set Methods
RANSAC
Neural Networks
Hidden Markov Models
Regular expressions

By simply being aware of their existence you greatly increase your chances of solving really big problems you'll be facing. For example naive Bayesian statistics was famously used for fighting spam, and in spite of its striking simplicity is much more effective than all the custom and complicated methods that preceded it.

In the last few years the data processing toolkit got one a new tool - PageRank. Pretty much everybody knows how it works for scoring websites, but the algorithm is capable of much more than that. One great example is extracting keywords from documents. It has nothing to do with the original problem of website scoring, but if you treat words as nodes (websites), create links between words that occur close to each other, and run PageRank on such a graph, you get very decent keywords. Of course you might want to add some pre- and post-processing to improve keyword quality (obviously removing HTML tags, also stemming, removing stopwords, weighting words by part of speech or whatever you feel like doing), but so does Google in determining pages' scores. And I bet you expected keyword extractors to either actually understand what's written (not possible yet) or to simply count number of occurences (really horrible results).

You can use PageRank to asses importance of countries in international trade, importance of people in organization's communication flow, and many other problems. Or you could simply throw arbitrary graphs at PageRank, look at the results and simply guess what they mean. Perhaps that will be enough to solve the problem you've been thinking about for such a long time. If not, you still have two dozen of other universally applicable techniques to choose from.