Saturday, April 16, 2016

Reliability of Scientific Facts

Fisher certainly understood that clearing the significance bar wasn’t the same thing as finding the truth. He envisions a richer, more iterated approach, writing in 1926: “A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.”

Not “succeeds once in giving,” but “rarely fails to give.” A statistically significant finding gives you a clue, suggesting a promising place to focus your research energy. The significance test is the detective, not the judge. You know how when you read an article about a breakthrough finding that this thing causes that thing, or that thing prevents the other thing, and at the end there’s always a banal sort of quote from a senior scientist not involved in the study intoning some very minor variant of “The finding is quite interesting, and suggests that more research in this direction is needed”? And how you don’t really even read that part because you think of it as an obligatory warning without content?

Here’s the thing—the reason scientists always say that is because it’s important and it’s true! The provocative and oh-so-statistically-significant finding isn’t the conclusion of the scientific process, but the bare beginning. If a result is novel and important, other scientists in other laboratories ought to test and retest the phenomenon and its variants, trying to figure out whether the result was a one-time fluke or whether it truly meets the Fisherian standard of “rarely fails.” That’s what scientists call replication; if an effect can’t be replicated, despite repeated trials, science backs apologetically away. The replication process is supposed to be science’s immune system, swarming over newly introduced objects and killing the ones that don’t belong.

That’s the ideal, at any rate. In practice, science is a bit immunosuppressed. Some experiments, of course, are hard to repeat. If your study measures a four-year-old’s ability to delay gratification and then relates these measurements with life outcomes thirty years later, you can’t just pop out a replication.
But even studies that could be replicated often aren’t. Every journal wants to publish a breakthrough finding, but who wants to publish the paper that does the same experiment a year later and gets the same result? Even worse, what happens to papers that carry out the same experiment and don’t find a significant result? For the system to work, those experiments need to be made public. Too, often they end up in the file drawer instead.