Beware the Weasel Word "Statistical" in Statistical Significance!

As Justin Wolfers pointed out in his post on income inequality last week, the Census Bureau was talking statistical nonsense. I blame the whole idea of statistical significance. For its weasel adjective “statistical” concedes that the significance might not be the kind about which you care. Here, I’ll explain what statistical significance is, and how its use is harmful to society.

To evaluate the statistical significance of an effect, you calculate the so-called p value; if the p value is small enough, the effect is declared statistically significant. For an example to illustrate the calculations, imagine that your two children Alice and Bob play 30 rounds of the card game “War,” and that the results are 20-10 in favor of Bob. Was he cheating?

To calculate the p value, you need an assumption, called the null (or no-effect) hypothesis: here, that the game results are due to chance (i.e. no cheating). The p value is the probability of getting results at least as extreme as the actual results of 20-10. Here, the probability of Bob’s winning at least 20 games is 0.049. (Try it out at Daniel Sloper’s “Cumulative Binomial Probability Calculator.”)

If this p value is low enough—typically, below 0.05—you reject the null hypothesis. (For the dangers of a magic value like 0.05, see the post “Is ‘Statistically Significant’ Really Significant?”) Here, because 0.049 is (barely!) less than 0.05, we reject the null hypothesis of fair play, and conclude that Bob probably cheated.

Although the procedure is mathematically well defined, it is nonsense:

  1. To calculate the probability of Bob’s winning at least 20 games, you use not only the data that you saw (that Bob won 20 games out of 30) but also the imaginary data of Bob winning 21 out of 30 games, or 22 out of 30 games, all the way up to Bob winning all 30 games. The idea of imaginary data is oxymoronic.

  2. You compute the probability of the data (the game results), rather than the probability that you want: the probability of the hypothesis (that Bob cheated).

  3. By using only the null hypothesis, you ignore information about the alternative hypotheses. For example, if Bob is 3 years old, you need far more decisive data to conclude that he cheated than you need if he is 15 years old and just back from reform school.

In short, a result can be statistically significant, yet meaningless. In my rare cynical moments, I see this problem as explaining why popular health advice flips every 5 or 10 years: for example, from “eat saturated fat” to “eat monounsaturated fats” to “eat no fat” to “Atkins diet.” The original results were statistically significant, but probably meaningless. After 10 years, researchers collect enough grant money to conduct another large-enough study. If it confirms the old advice, no one hears about it (“dog bites man”). If it overturns the old advice (“man bites dog”), publish and report away!

Fortunately, there is a sound alternative: Use Bayes theorem. It solves the three problems with traditional significance testing:

  1. The calculation using Bayes theorem does not depend on imaginary data. It considers only the actual data of Bob’s winning 20 games out of 30.

  2. Bayes theorem directly gives you the probability of cheating given the data (instead of the probability of the data, given fair play). Here, Bayes theorem tells you that, except for contorted alternative hypotheses, the data increases the odds on cheating by at most a factor of 5.5. So, even if you give your initial probability that Bob cheated as 50-50, the new probability, after incorporating the 20-10 game results, is at most 85 percent. This probability is much less than the 95 percent often, and incorrectly, assumed to result after “rejecting the null hypothesis at the 0.05 level” (as happened here).

  3. Bayes theorem considers the alternative hypotheses. Although specifying the alternative hypotheses (and your beliefs in them) requires more thought than evaluating statistical significance, drawing sensible conclusions requires this thinking anyway. Bayes theorem merely forces you to think correctly.

Whereas using p values lets you avoid thinking and produce nonsense. Because p values and statistical significance underlie almost all medical research, society is basing life-and-death decisions upon mathematical quicksand.

You can learn more from many sources:

  1. The article “ESP and the significance of significance” from understandinguncertainty.org.

  2. Steven Goodman’s paper “A Dirty Dozen: Twelve P-Value Misconceptions,” from the “Further reading” of Wikipedia’s quite good article on p values. The analysis of the misconceptions is very helpful, although the second equation on page 139 (authors, please number your equations!), a.k.a. page 5 of the PDF file, is a misprint. The factor “posterior odds(H_0, given the data)” should read “prior odds(H_0)”.

  3. Edwin Jaynes’s paper Confidence intervals vs Bayesian intervals (for the mathematically prepared!) and magnus opus Probability Theory: The Logic of Science (CUP, 2003).


Chris Auld

There are lots of decision theoretic critiques of frequentist inference. I don't think, however, that many of the points Sanjoy makes are drawn from those critiques, and he makes what appear to be some conceptual errors in his description of the frequentist approach.

Both Bayesian and classical (maximum likelihood based) analysis of binomial events involve specifying probability models for the data. Under those models, the probability of any outcome can be calculated. Referring to those calculations as "using imaginary data" is just wrong.

The alternative hypothesis must be specified in the frequentist approach. It is most commonly given as alternative direction rather than a point hypothesis. In this case, we would usually test the null that the probability, say, Alice wins is 0.5 versus the alternative that that probability is not 0.5. It is also possible to test against point alternatives in the frequentist paradigm, analogous to what Sanjay does (as he details in the comments). Since we're then asking different questions of the data, we generally get different answers, so the fact that the p-value in the usual test is not equal to Sanjay's odds ratio (and wouldn't be equal to the p-value in the frequentist test against the point null that Alice wins with probability 2/3) is not remarkable and is not a valid criticism of the usual frequentist test.

Interpreting data which cast doubt on the hypothesis that Alice wins with probability 0.5 as evidence of cheating *has nothing to do with the statistical model per se*. We're testing the hypothesis (or doing the Bayesian analog) that a parameter is 0.5, not that "Alice cheated." Those two statements are only equivalent if we impose enough additional assumptions such that the *only* reason Alice does not win with probability 0.5 is Alice cheated. Either a Bayesian or classical approach to the problem could be embellished to allow for other observable variables which affect the probability of winning, such as age.

Finally, it is possible in either a Bayesian or classical framework for an analyst to confuse statements like, "the data tell us X and Y move together" and "the data tell us the magnitude of the relationship between X and Y scientifically important." That is, confusing statistical and scientific significance is an error of interpretation, not something intrinsically problematic in either statistical approach.

Read more...

Sanjoy Mahajan

We are discussing different procedures when we say "classical." I am talking about null-hypothesis significance testing (often NHST in the literature). You are talking about maximum-likelihood (ML) inference. ML is Bayes theorem without the prior-probability factor (i.e. assuming that all hypotheses are a prior equally plausible).

To see why ML is not sufficient, consider the following inference problem (due to I.J. Good, I think). You put your finger down at a random spot in a phone book, assuming that anyone still uses one or has one around, and it points to the name "Jameson." The ML hypothesis is that the entire phone book is just "Jameson" (perhaps with varying first names). Our intuition is not happy with this conclusion, because even with this datum the posterior probability of this hypothesis is extremely low (even though the likelihood P(D|H) is 1). Bayes theorem captures this intuition with the tiny prior probability of the hypothesis.

Read more...

Chris Auld

Sanjoy, I specified "maximum likelihood based" to highlight that in both that approach in classical inference and in Bayesian modeling the probability of any outcome from the experiment can be calculated conditional on the parameters, which isn't necessarily true in other frequentist approaches requiring weaker assumptions. Perhaps I do not understand your "imaginary data" complaint: can you provide a cite?

No one, well, no one who knows what they're doing, actually does significance testing mechanically. I don't know anyone who would treat a p-value of 4.9% as being substantively (or statistically, for that matter) different than a p-value of 5.1%, for example.

I misread your example with players of different ages, but the rest of my points stand. Here's another: the modifier "statistical" in front of "inference" is the exact opposite of a weasel word. Statistical and substantive significance, as you point out, are different concepts. It is good writing to never use the word "significance" without clearly specifying whether you mean statistically or substantively. It's not weasily, it's necessary, to let the reader know which of these concepts is intended when using the word "significance." The Census Bureau's gaffe nicely illustrates this point!

Read more...

Bjorn Roche

There is a much bigger problem with your original evaluation of weather or not bob is cheating: you only ask the question (is bob cheating?) when something unusual happens (eg, bob wins a bunch of games in a row). Therefore the sample is inherently biased.

Sanjoy Mahajan

That's an interesting point. I agree that you shouldn't ignore relevant data. But what the relevant data are depends on what hypothesis you are testing. If Bob just came back from reform school, these 30 games are the first set played since his return, and you are wondering whether he cheated, then those are the right data. If the hypothesis is that Bob has been cheating for the past week, then you'd use data from the past week's games (if you have it).

Jerome Solanum

Actually, what we should be aware of is people who only think they understand the notion of statistical significance, but then in reality create straw men to disparage it.

Jim Hess

I am so tried hearing about the 47% of Americans that pay no taxes. How many millionaires pay no taxes and what percentage of that of the total millionaires

sam

I stumbled upon this website. Everyones response seems brilliant but can someone explain how statistical significance harms society using an example besides Bob.