# Beware the Weasel Word “Statistical” in Statistical Significance!

As Justin Wolfers pointed out in his post on income inequality last week, the Census Bureau was talking statistical nonsense. I blame the whole idea of statistical significance. For its weasel adjective “statistical” concedes that the significance might not be the kind about which you care. Here, I’ll explain what statistical significance is, and how its use is harmful to society.

To evaluate the statistical significance of an effect, you calculate the so-called p value; if the p value is small enough, the effect is declared statistically significant. For an example to illustrate the calculations, imagine that your two children Alice and Bob play 30 rounds of the card game “War,” and that the results are 20-10 in favor of Bob. Was he cheating?

To calculate the p value, you need an assumption, called the null (or no-effect) hypothesis: here, that the game results are due to chance (i.e. no cheating). The p value is the probability of getting results at least as extreme as the actual results of 20-10. Here, the probability of Bob’s winning at least 20 games is 0.049. (Try it out at Daniel Sloper’s “Cumulative Binomial Probability Calculator.”)

If this p value is low enough—typically, below 0.05—you reject the null hypothesis. (For the dangers of a magic value like 0.05, see the post “Is ‘Statistically Significant’ Really Significant?”) Here, because 0.049 is (barely!) less than 0.05, we reject the null hypothesis of fair play, and conclude that Bob probably cheated.

Although the procedure is mathematically well defined, it is nonsense:

1. To calculate the probability of Bob’s winning at least 20 games, you use not only the data that you saw (that Bob won 20 games out of 30) but also the imaginary data of Bob winning 21 out of 30 games, or 22 out of 30 games, all the way up to Bob winning all 30 games. The idea of imaginary data is oxymoronic.

2. You compute the probability of the data (the game results), rather than the probability that you want: the probability of the hypothesis (that Bob cheated).

3. By using only the null hypothesis, you ignore information about the alternative hypotheses. For example, if Bob is 3 years old, you need far more decisive data to conclude that he cheated than you need if he is 15 years old and just back from reform school.

In short, a result can be statistically significant, yet meaningless. In my rare cynical moments, I see this problem as explaining why popular health advice flips every 5 or 10 years: for example, from “eat saturated fat” to “eat monounsaturated fats” to “eat no fat” to “Atkins diet.” The original results were statistically significant, but probably meaningless. After 10 years, researchers collect enough grant money to conduct another large-enough study. If it confirms the old advice, no one hears about it (“dog bites man”). If it overturns the old advice (“man bites dog”), publish and report away!

Fortunately, there is a sound alternative: Use Bayes theorem. It solves the three problems with traditional significance testing:

1. The calculation using Bayes theorem does not depend on imaginary data. It considers only the actual data of Bob’s winning 20 games out of 30.

2. Bayes theorem directly gives you the probability of cheating given the data (instead of the probability of the data, given fair play). Here, Bayes theorem tells you that, except for contorted alternative hypotheses, the data increases the odds on cheating by at most a factor of 5.5. So, even if you give your initial probability that Bob cheated as 50-50, the new probability, after incorporating the 20-10 game results, is at most 85 percent. This probability is much less than the 95 percent often, and incorrectly, assumed to result after “rejecting the null hypothesis at the 0.05 level” (as happened here).

3. Bayes theorem considers the alternative hypotheses. Although specifying the alternative hypotheses (and your beliefs in them) requires more thought than evaluating statistical significance, drawing sensible conclusions requires this thinking anyway. Bayes theorem merely forces you to think correctly.

Whereas using p values lets you avoid thinking and produce nonsense. Because p values and statistical significance underlie almost all medical research, society is basing life-and-death decisions upon mathematical quicksand.

1. Steven Goodman’s paper “A Dirty Dozen: Twelve P-Value Misconceptions,” from the “Further reading” of Wikipedia’s quite good article on p values. The analysis of the misconceptions is very helpful, although the second equation on page 139 (authors, please number your equations!), a.k.a. page 5 of the PDF file, is a misprint. The factor “posterior odds(H_0, given the data)” should read “prior odds(H_0)”.

2. Edwin Jaynes’s paper Confidence intervals vs Bayesian intervals (for the mathematically prepared!) and magnus opus Probability Theory: The Logic of Science (CUP, 2003).

1. Marc Orlitzky says:

Yes, statistical significance testing is a big problem in all social sciences, as recently summarized in an article in journal _Organizational Research Methods_:
http://marcorlitzky.webs.com/Papers/orlitzky2012orm.pdf

Several alternatives to NHST exist…but their implementation won’t be easy or fast. Any deinstitutionalization of a firmly institutionalized practice takes time.

For more background on the genesis of the institution of NHST:

3  1
• Sanjoy Mahajan says:

Thank you for those excellent references.

1  0
2. Alex says:

Is Bayes’ Theorem not statistical? If I run a Bayesian analysis and conclude Bob cheated, should I not say I have ‘statistically significant’ evidence? Had the Census Bureau run a Bayesian analysis with ridiculous priors, would that have made their press release better?

Well-loved. Like or Dislike: 7  1
• Sanjoy Mahajan says:

If it looks like Bob cheated, instead of saying “statistically significant,” I would say, “I have convincing evidence.” If you got an inference method that gives you what you want (the probability of the hypothesis), flaunt it– by stating your conclusion in plain words!

0  3
3. Seth says:

From xkcd: http://xkcd.com/882/

Basically, if you ask enough questions you’re going to get false positives.

4  0
4. Seminymous Coward says:

Probability distributions are not oxymoronic.

Science isn’t supposed to care that you think Bob is likely to cheat or not. Maybe your opinion of the moral fortitude of 5-year-olds is wrong; it certainly shouldn’t change your answer.

Publication bias has nothing to do with choice of significance tests. People will find Bayesian null results just as unexciting.

That Bayesian methods give a different answer for your hypothetical example is not evidence of their superiority. It’d only be even a mere supportive anecdote if it were a real example.

I’m barely capable of conceiving of the arrogance involved in calling frequentist inference “mathematical quicksand” and dismissing the work of so many people, many of them brilliant, so casually.

Yeah, Bayesian probability sure is neat, though. I just never expected to get trolled via it. You collect a new data point every day, I guess.

Well-loved. Like or Dislike: 35  2
• Marc Orlitzky says:

Yes, psychologically, “people will find Bayesian null results just as unexciting.”

But, substantively, null values in Bayesian analysis *are* in fact more meaningful (and therefore “exciting”) than null values in the traditional frequentist methods. This will be shown in a forthcoming _Organizational Research Methods_ article by
John K. Kruschke, Department of Psychological and Brain Sciences, Indiana University,
Herman Aguinis and Harry Joo, Department of Management and Entrepreneurship, Kelley

Other “objections” against Bayesian methods will be addressed in the same paper titled “The Time Has Come: Bayesian Methods for Data Analysis in the Organizational Sciences.”

Of course, what no paper can ultimately address is the psychological defensiveness of objectivist-frequentists against the alleged “subjectivism” or “arbitrariness” of Bayesian methodology.

2  5
• Seminymous Coward says:

Publication bias is purely psychological, though. Null results from frequentist tests are perfectly legitimate information. At bare minimum, they have clear value as time-savers for study designers. People just don’t _like_ them.

You’re clearly a subject matter expert on frequentist vs. Bayesian methods, so I’ll defer to your other claims. For the record, I don’t have anything against Bayesian methods; I just didn’t like this article’s arguments.

2  0
• draypresct says:

Does your paper address the issue of decision-making? The last lecture I attended on this subject, the lecturer (Ken Rothman, a Bayesian) claimed he never made a type I or a type II error because he never concluded one way or the other whether there was an effect. He dismissed frequentist objections to this position as simply not understanding decision theory.

While I admit the “p<0.05" can and often has been misused (much like every other statistic, including the mean), it's still a valuable tool. I have more of a problem with never making a decision than I do with any particular decision-making methodology.

3  0
• Sanjoy Mahajan says:

Sometimes being brilliant is a handicap, because you can go in the wrong direction very fast and far.

Or you can derive results using very painful methods, which less brilliant people can only do by using much better methods. An example is the theory of sequential sampling, where you collect data one sample at a time (think of testing parts coming off an assembly line to decide whether the manufacturing operation is operating correctly). A stopping rule like that is very hard to handle with frequentist statistics, where the stopping rule affects the sampling distribution and therefore the test of statistical significance. You can do it, and Abraham Wald did it, but it is very painful.

And you get the same results with a few lines of Bayesian inference.

Disliked! Like or Dislike: 2  9
5. KevinH says:

While I agree that standard statistical analysis used by many are problematic, bayes is full of other potential pitfalls, which your example clearly demonstrates but you neglect to focus on. In general Bayes tends to invoke spurious specificity.

First, you need to have an exact model of what cheating will do to the data. For example when you say “Bayes theorem tells you that, except for contorted alternative hypotheses, the data increases the odds on cheating by at most a factor of 5.5″ you don’t actually mean that Bayes theorm tells you this, but that your specific model of cheating does, after utalizing Baye’s theorm. If your model of ‘cheating’ is that it increases the odds of winning to 2/3rds, well then of course it is a good fit to data in which we have a 2/3rd win rate. If we however think that ‘cheating’ can increase your odds of winning to 60/40, or 90/10, then this means the data will not fit as well.

There is also of course the issue of priors that you allude to. “if Bob is 3 years old, you need far more decisive data to conclude that he cheated than you need if he is 15 years old and just back from reform school.” While Bayes can account for this, it also requires you to account for it accurately. If you assume that a 15 year old is 10 times as likely to cheat, when in fact they are only 4 times as likely to cheat, your results will be wrong.

In general Bayes is a double edged sword. It forces you to make concrete predictions about prior probabilities and models which relate the data and the hypotheses. If those assumptions are correct, then Bayes is an extremely powerful tool. However, if they are incorrect, application of Bayes theorems can lead to severely distorted reasoning and analysis.

Well-loved. Like or Dislike: 34  2
6. Kevin says:

And yet to use Bayes theorem all we require is the (imaginary) completely correct specification for the conditional distribution of the data. If the distribution is not correct, then the inference made is incorrect irrespective of the sample size.

Well-loved. Like or Dislike: 11  1
7. Mark says:

“Because p values and statistical significance underlie almost all medical research, society is basing life-and-death decisions upon mathematical quicksand.”
This statement as written is vague. If we replace ‘medical research’ with ‘standard of care’ and ‘society’ with ‘health care professionals’, then the statement above is incorrect. The statistical rationale and method required to provide evidence that the ‘standard of care’ does not lead to life threatening injury is far more rigorous than mere tests of significance at a 95% confidence level. One of the main reasons why ‘medical research’ takes so long to enter the realm of ‘standard of care’ is that it is very difficult to prove the cause and effect when the subject is not confined to a laboratory, consuming a controlled diet and performing prescribed activities like lab rodents are.

3  0
8. Kendal says:

Mr. Mahajan,
This post seems to include some good arguments in favor of Bayesian statistics and against “frequentist” or classical statistics. I agree that there are valid criticisms, but I’m not sure that either is superior. They each involve a different way of looking at probability, so I’m not sure that comparisons are always appropriate.

For your example, you indicate that you use a prior of 50-50 for Bob’s probability of cheating. But we don’t have any data on whether he cheated; we only have data on whether he won. So I’m not sure how you used the data to update this prior. Or did you actually use a prior of 50-50 for Bob’s probability of winning? If so, then it seems to me that the final Bayesian estimate relates to Bob’s probability of winning and not Bob’s probability of cheating (two very different things).

Well-loved. Like or Dislike: 6  0
• James says:

Having just read the poker post, I have to wonder whether Bob is cheating, or just playing more skillfully?

3  0
• James says:

As for instance back in the days when (Thorne, was it?) did the first statistical analysis of blackjack, and developed card-counting techniques. I made a good bit of money (at least for me in those days) using them. I’m sure the casinos would have thought I was cheating, but I thought it was my superior skill.

3  0
• Sanjoy Mahajan says:

Sorry, I explained it unclearly. I used the odd’s form of Bayes theorem:

Odds(hypothesis|data) = likelihood ratio * Odds(hypothesis)

The odds are the ratio prob/(1-prob). The likelihood ratio is

Prob(data|hypothesis) / Prob(data|opposite hypothesis).

Let’s say that the hypothesis is “Bob cheated.” Then the opposite hypothesis is fair play, and the denominator is easy to calculate: It’s the probability of 20 wins out of 30, given each player has the same chance of winning a game. That’s (1/2)*30.

But the numerator is harder, because you have to specify what the effect of cheating is. To make the comparison as favorable to frequentist statistics as possible, I chose the hypothesis that gives the largest likelihood ratio, namely the hypothesis that cheating gives you a probability of 2/3 of winning a game. Then the probability of 20 wins out of 30 is (2/3)^20 * (1/3)^10. The ratio is 5.5.

Now it’s time to choose the prior odds. Again, to make the comparison as favorable to frequentist statistics as possible, I gave a quite high value to the prior probability of Bob’s having cheated, taking it to be 0.5 (and I hope it’s less with my kids). That gives 1-1 odds, or just simply 1. The final odds are then 5.5 at most, making a final (posterior) probability of about 0.85.

2  0
• Kendal says:

Sanjoy,
Thanks for your quick response. I think I understand your example now. I am a little concerned with one thing, however. In your Bayesian approach, I believe you are specifying two states of the world: If Bob is not a cheater, he has a winning probability of 0.5. If he is a cheater, his winning probability is 0.67. I understand why you chose this value, but there are of course more possibilities. Depending on his skill in deception, a cheating Bob might have a winning probability anywhere between 0.51 and 1.0. The frequentist approach takes this into account. The alternative hypothesis simply states that Bob’s winning probability is greater than 0.5.

Another option would be to use a Bayesian approach in which the prior distribution is continuous. That is, we allow for the possibility that Bob’s play (fair or not) could result in any winning probability between 0 and 1. We would weight these different probabilities according to our prior knowledge of Bob (this is where his age and other factors would come into play). Then we obtain data (20 wins in 30 games) to update our prior distribution,and we produce a posterior distribution on Bob’s winning probability. Instead of obtaining one value for probability, we would have a distribution of values which we could analyze. Final results would depend on our prior. But I believe that if our prior incorporates a reasonable suspicion that Bob is a successful cheater, our final analysis of the posterior distribution will actually be more condemning of Bob than the frequentist answer. That is, I’m guessing we would arrive at a very high probability that Bob cheats after we slice the posterior distribution. Let me know what you think.

2  0
• Seminymous Coward says:

You’ve probably realized this by now, but someone else seems confused and Sanjoy didn’t mention it in his reply. The card game in the example, war, is skill-free and, in fact, choice-free. http://en.wikipedia.org/wiki/War_(card_game) That obviously implies that a 50% chance of winning is correct for fair play regardless of the players.

2  0