In 1925, a statistician named R. A. Fisher published a book called Statistical Methods for Research Workers. In it, he proposed a line in the sand: 0.05. In other words, let’s filter out 19 of every 20 flukes.
Why let through the other one in 20? Well, you can set the threshold lower than 5% if you like. Fisher himself was happy to consider 2% or 1%. But this drive to avoid false positives incurs a new risk: false negatives. The more flukes you weed out, the more true results get caught in the filter as well.
Suppose you’re studying whether men are taller than women. Hint: they are. But what if your sample is a little fluky? What if you happen to pick taller-than-typical women and shorter-than-typical men, yielding an average difference of just 1 or 2 inches? Then a strict p-value threshold may reject the result as a fluke, even though it’s quite genuine.
The number 0.05 represents a compromise, a middle ground between incarcerating the innocent and letting the guilty walk free.
For his part, Fisher never meant 0.05 as an ironclad rule. In his own career, he showed an impressive flexibility. Once, in a single paper, he smiled on a p-value of 0.089 (“some reason to suspect that the distribution… is not wholly fortuitous”) yet waved off one of 0.093 (“such association, if it exists, is not strong enough to show up significantly”).
To me, this makes sense. A foolish consistency is the hobgoblin of little statisticians. If you tell me that after-dinner mints cure bad breath (p = 0.04), I’m inclined to believe you. If you tell me that after-dinner mints cure osteoporosis (p = 0.04), I’m less persuaded. I admit that 4% is a low probability. But I judge it even less likely that science has, for decades, overlooked a powerful connection between skeletal health and Tic Tacs.
All new evidence must be weighed against existing knowledge. Not all 0.04s are created equal.