Search the Site

Beware the Weasel Word "Statistical" in Statistical Significance!

As Justin Wolfers pointed out in his post on income inequality last week, the Census Bureau was talking statistical nonsense. I blame the whole idea of statistical significance. For its weasel adjective “statistical” concedes that the significance might not be the kind about which you care. Here, I’ll explain what statistical significance is, and how its use is harmful to society.

To evaluate the statistical significance of an effect, you calculate the so-called p value; if the p value is small enough, the effect is declared statistically significant. For an example to illustrate the calculations, imagine that your two children Alice and Bob play 30 rounds of the card game “War,” and that the results are 20-10 in favor of Bob. Was he cheating?

To calculate the p value, you need an assumption, called the null (or no-effect) hypothesis: here, that the game results are due to chance (i.e. no cheating). The p value is the probability of getting results at least as extreme as the actual results of 20-10. Here, the probability of Bob’s winning at least 20 games is 0.049. (Try it out at Daniel Sloper’s “Cumulative Binomial Probability Calculator.”)

If this p value is low enough—typically, below 0.05—you reject the null hypothesis. (For the dangers of a magic value like 0.05, see the post “Is ‘Statistically Significant’ Really Significant?”) Here, because 0.049 is (barely!) less than 0.05, we reject the null hypothesis of fair play, and conclude that Bob probably cheated.

Although the procedure is mathematically well defined, it is nonsense:

  1. To calculate the probability of Bob’s winning at least 20 games, you use not only the data that you saw (that Bob won 20 games out of 30) but also the imaginary data of Bob winning 21 out of 30 games, or 22 out of 30 games, all the way up to Bob winning all 30 games. The idea of imaginary data is oxymoronic.
  2. You compute the probability of the data (the game results), rather than the probability that you want: the probability of the hypothesis (that Bob cheated).
  3. By using only the null hypothesis, you ignore information about the alternative hypotheses. For example, if Bob is 3 years old, you need far more decisive data to conclude that he cheated than you need if he is 15 years old and just back from reform school.

In short, a result can be statistically significant, yet meaningless. In my rare cynical moments, I see this problem as explaining why popular health advice flips every 5 or 10 years: for example, from “eat saturated fat” to “eat monounsaturated fats” to “eat no fat” to “Atkins diet.” The original results were statistically significant, but probably meaningless. After 10 years, researchers collect enough grant money to conduct another large-enough study. If it confirms the old advice, no one hears about it (“dog bites man”). If it overturns the old advice (“man bites dog”), publish and report away!

Fortunately, there is a sound alternative: Use Bayes theorem. It solves the three problems with traditional significance testing:

  1. The calculation using Bayes theorem does not depend on imaginary data. It considers only the actual data of Bob’s winning 20 games out of 30.
  2. Bayes theorem directly gives you the probability of cheating given the data (instead of the probability of the data, given fair play). Here, Bayes theorem tells you that, except for contorted alternative hypotheses, the data increases the odds on cheating by at most a factor of 5.5. So, even if you give your initial probability that Bob cheated as 50-50, the new probability, after incorporating the 20-10 game results, is at most 85 percent. This probability is much less than the 95 percent often, and incorrectly, assumed to result after “rejecting the null hypothesis at the 0.05 level” (as happened here).
  3. Bayes theorem considers the alternative hypotheses. Although specifying the alternative hypotheses (and your beliefs in them) requires more thought than evaluating statistical significance, drawing sensible conclusions requires this thinking anyway. Bayes theorem merely forces you to think correctly.

Whereas using p values lets you avoid thinking and produce nonsense. Because p values and statistical significance underlie almost all medical research, society is basing life-and-death decisions upon mathematical quicksand.

You can learn more from many sources:

  1. The article “ESP and the significance of significance” from
  2. Steven Goodman’s paper “A Dirty Dozen: Twelve P-Value Misconceptions,” from the “Further reading” of Wikipedia’s quite good article on p values. The analysis of the misconceptions is very helpful, although the second equation on page 139 (authors, please number your equations!), a.k.a. page 5 of the PDF file, is a misprint. The factor “posterior odds(H_0, given the data)” should read “prior odds(H_0)”.
  3. Edwin Jaynes’s paper Confidence intervals vs Bayesian intervals (for the mathematically prepared!) and magnus opus Probability Theory: The Logic of Science (CUP, 2003).