# Beware the Weasel Word "Statistical" in Statistical Significance!

As Justin Wolfers pointed out in his post on income inequality last week, the Census Bureau was talking statistical nonsense. I blame the whole idea of statistical significance. For its weasel adjective “statistical” concedes that the significance might not be the kind about which you care. Here, I’ll explain what statistical significance is, and how its use is harmful to society.

To evaluate the statistical significance of an effect, you calculate the so-called p value; if the p value is small enough, the effect is declared statistically significant. For an example to illustrate the calculations, imagine that your two children Alice and Bob play 30 rounds of the card game “War,” and that the results are 20-10 in favor of Bob. Was he cheating?

To calculate the p value, you need an assumption, called the null (or no-effect) hypothesis: here, that the game results are due to chance (i.e. no cheating). The p value is the probability of getting results at least as extreme as the actual results of 20-10. Here, the probability of Bob’s winning at least 20 games is 0.049. (Try it out at Daniel Sloper’s “Cumulative Binomial Probability Calculator.”)

If this p value is low enough—typically, below 0.05—you reject the null hypothesis. (For the dangers of a magic value like 0.05, see the post “Is ‘Statistically Significant’ Really Significant?”) Here, because 0.049 is (barely!) less than 0.05, we reject the null hypothesis of fair play, and conclude that Bob probably cheated.

Although the procedure is mathematically well defined, it is nonsense:

1. To calculate the probability of Bob’s winning at least 20 games, you use not only the data that you saw (that Bob won 20 games out of 30) but also the imaginary data of Bob winning 21 out of 30 games, or 22 out of 30 games, all the way up to Bob winning all 30 games. The idea of imaginary data is oxymoronic.

2. You compute the probability of the data (the game results), rather than the probability that you want: the probability of the hypothesis (that Bob cheated).

3. By using only the null hypothesis, you ignore information about the alternative hypotheses. For example, if Bob is 3 years old, you need far more decisive data to conclude that he cheated than you need if he is 15 years old and just back from reform school.

In short, a result can be statistically significant, yet meaningless. In my rare cynical moments, I see this problem as explaining why popular health advice flips every 5 or 10 years: for example, from “eat saturated fat” to “eat monounsaturated fats” to “eat no fat” to “Atkins diet.” The original results were statistically significant, but probably meaningless. After 10 years, researchers collect enough grant money to conduct another large-enough study. If it confirms the old advice, no one hears about it (“dog bites man”). If it overturns the old advice (“man bites dog”), publish and report away!

Fortunately, there is a sound alternative: Use Bayes theorem. It solves the three problems with traditional significance testing:

1. The calculation using Bayes theorem does not depend on imaginary data. It considers only the actual data of Bob’s winning 20 games out of 30.

2. Bayes theorem directly gives you the probability of cheating given the data (instead of the probability of the data, given fair play). Here, Bayes theorem tells you that, except for contorted alternative hypotheses, the data increases the odds on cheating by at most a factor of 5.5. So, even if you give your initial probability that Bob cheated as 50-50, the new probability, after incorporating the 20-10 game results, is at most 85 percent. This probability is much less than the 95 percent often, and incorrectly, assumed to result after “rejecting the null hypothesis at the 0.05 level” (as happened here).

3. Bayes theorem considers the alternative hypotheses. Although specifying the alternative hypotheses (and your beliefs in them) requires more thought than evaluating statistical significance, drawing sensible conclusions requires this thinking anyway. Bayes theorem merely forces you to think correctly.

Whereas using p values lets you avoid thinking and produce nonsense. Because p values and statistical significance underlie almost all medical research, society is basing life-and-death decisions upon mathematical quicksand.

1. Steven Goodman’s paper “A Dirty Dozen: Twelve P-Value Misconceptions,” from the “Further reading” of Wikipedia’s quite good article on p values. The analysis of the misconceptions is very helpful, although the second equation on page 139 (authors, please number your equations!), a.k.a. page 5 of the PDF file, is a misprint. The factor “posterior odds(H_0, given the data)” should read “prior odds(H_0)”.

2. Edwin Jaynes’s paper Confidence intervals vs Bayesian intervals (for the mathematically prepared!) and magnus opus Probability Theory: The Logic of Science (CUP, 2003).

#### Marc Orlitzky

Yes, statistical significance testing is a big problem in all social sciences, as recently summarized in an article in journal _Organizational Research Methods_:
http://marcorlitzky.webs.com/Papers/orlitzky2012orm.pdf

Several alternatives to NHST exist...but their implementation won't be easy or fast. Any deinstitutionalization of a firmly institutionalized practice takes time.

For more background on the genesis of the institution of NHST:

#### Alex

Is Bayes' Theorem not statistical? If I run a Bayesian analysis and conclude Bob cheated, should I not say I have 'statistically significant' evidence? Had the Census Bureau run a Bayesian analysis with ridiculous priors, would that have made their press release better?

#### Seth

From xkcd: http://xkcd.com/882/

Basically, if you ask enough questions you're going to get false positives.

#### Seminymous Coward

Probability distributions are not oxymoronic.

Science isn't supposed to care that you think Bob is likely to cheat or not. Maybe your opinion of the moral fortitude of 5-year-olds is wrong; it certainly shouldn't change your answer.

Publication bias has nothing to do with choice of significance tests. People will find Bayesian null results just as unexciting.

That Bayesian methods give a different answer for your hypothetical example is not evidence of their superiority. It'd only be even a mere supportive anecdote if it were a real example.

I'm barely capable of conceiving of the arrogance involved in calling frequentist inference "mathematical quicksand" and dismissing the work of so many people, many of them brilliant, so casually.

Yeah, Bayesian probability sure is neat, though. I just never expected to get trolled via it. You collect a new data point every day, I guess.

#### Marc Orlitzky

Yes, psychologically, "people will find Bayesian null results just as unexciting."

But, substantively, null values in Bayesian analysis *are* in fact more meaningful (and therefore "exciting") than null values in the traditional frequentist methods. This will be shown in a forthcoming _Organizational Research Methods_ article by
John K. Kruschke, Department of Psychological and Brain Sciences, Indiana University,
Herman Aguinis and Harry Joo, Department of Management and Entrepreneurship, Kelley

Other "objections" against Bayesian methods will be addressed in the same paper titled "The Time Has Come: Bayesian Methods for Data Analysis in the Organizational Sciences."

Of course, what no paper can ultimately address is the psychological defensiveness of objectivist-frequentists against the alleged "subjectivism" or "arbitrariness" of Bayesian methodology.

#### Seminymous Coward

Publication bias is purely psychological, though. Null results from frequentist tests are perfectly legitimate information. At bare minimum, they have clear value as time-savers for study designers. People just don't _like_ them.

You're clearly a subject matter expert on frequentist vs. Bayesian methods, so I'll defer to your other claims. For the record, I don't have anything against Bayesian methods; I just didn't like this article's arguments.

#### KevinH

While I agree that standard statistical analysis used by many are problematic, bayes is full of other potential pitfalls, which your example clearly demonstrates but you neglect to focus on. In general Bayes tends to invoke spurious specificity.

First, you need to have an exact model of what cheating will do to the data. For example when you say "Bayes theorem tells you that, except for contorted alternative hypotheses, the data increases the odds on cheating by at most a factor of 5.5" you don't actually mean that Bayes theorm tells you this, but that your specific model of cheating does, after utalizing Baye's theorm. If your model of 'cheating' is that it increases the odds of winning to 2/3rds, well then of course it is a good fit to data in which we have a 2/3rd win rate. If we however think that 'cheating' can increase your odds of winning to 60/40, or 90/10, then this means the data will not fit as well.

There is also of course the issue of priors that you allude to. "if Bob is 3 years old, you need far more decisive data to conclude that he cheated than you need if he is 15 years old and just back from reform school." While Bayes can account for this, it also requires you to account for it accurately. If you assume that a 15 year old is 10 times as likely to cheat, when in fact they are only 4 times as likely to cheat, your results will be wrong.

In general Bayes is a double edged sword. It forces you to make concrete predictions about prior probabilities and models which relate the data and the hypotheses. If those assumptions are correct, then Bayes is an extremely powerful tool. However, if they are incorrect, application of Bayes theorems can lead to severely distorted reasoning and analysis.

#### Kevin

And yet to use Bayes theorem all we require is the (imaginary) completely correct specification for the conditional distribution of the data. If the distribution is not correct, then the inference made is incorrect irrespective of the sample size.

#### Mark

"Because p values and statistical significance underlie almost all medical research, society is basing life-and-death decisions upon mathematical quicksand."
This statement as written is vague. If we replace 'medical research' with 'standard of care' and 'society' with 'health care professionals', then the statement above is incorrect. The statistical rationale and method required to provide evidence that the 'standard of care' does not lead to life threatening injury is far more rigorous than mere tests of significance at a 95% confidence level. One of the main reasons why 'medical research' takes so long to enter the realm of 'standard of care' is that it is very difficult to prove the cause and effect when the subject is not confined to a laboratory, consuming a controlled diet and performing prescribed activities like lab rodents are.

#### Kendal

Mr. Mahajan,
This post seems to include some good arguments in favor of Bayesian statistics and against "frequentist" or classical statistics. I agree that there are valid criticisms, but I'm not sure that either is superior. They each involve a different way of looking at probability, so I'm not sure that comparisons are always appropriate.

For your example, you indicate that you use a prior of 50-50 for Bob's probability of cheating. But we don't have any data on whether he cheated; we only have data on whether he won. So I'm not sure how you used the data to update this prior. Or did you actually use a prior of 50-50 for Bob's probability of winning? If so, then it seems to me that the final Bayesian estimate relates to Bob's probability of winning and not Bob's probability of cheating (two very different things).

#### James

Having just read the poker post, I have to wonder whether Bob is cheating, or just playing more skillfully?

#### James

As for instance back in the days when (Thorne, was it?) did the first statistical analysis of blackjack, and developed card-counting techniques. I made a good bit of money (at least for me in those days) using them. I'm sure the casinos would have thought I was cheating, but I thought it was my superior skill.

#### zepplin

Bayesian analysis is not without its downside, and the frequentist approach is not as unreasonable as you make it out to be.

It is not the method of inference that is causing these poor papers, but the quality of the researchers themselves and their poor understanding/application of statistics. Anyone who can pick a poor null hypothesis can pick a equally poor Bayesian prior.

In your particular example, the null hypothesis of "did not cheat" vs "cheated" is not tested correctly. The test only compares "equal" vs "unequal" chance of winning. If it is indeed the case that the null hypothesis "did not cheat" allows for unequal chance of winning in the event of skill difference, then the calculation of the type I error is incorrect. You would have to account for that probability in some arbitrary way to water down the result (much like your arbitrary assignment of a 50-50 prior). In fact, any frequentist result can be matched by picking a specific prior.

The biggest problems are outside the Bayesian vs Frequentist debate all together. In your particular example, simply seeing a 20:10 game should not lead to a immediate conclusion but to a new experiment of 30 games. For researchers, the selection bias of both only looking at and only publishing interesting results will always be the most significant problem.

And for journalists and headline writers, the desire to report sensationalist results with little or no nuance will remain the most significant problem. In our example, it is extremely unlikely the peer reviewed source paper would have said "cheat" vs "not cheat" in the actual conclusions as opposed to perhaps mentioning it in the the abstract/background. Similarly, the correlation vs. causation disclaimer is also often "lost in translation".

Unfortunately, there is no sound alternative to meticulously account for biases and conduct careful experiments and analysis, whichever statistical approach a researcher decide to use, and there is no sound alternative for a reader to learn statisctics themselves and examine the source paper, whichever statistical approach an article cites.

#### Marc

Actually, the quality and "progress" of science--or the lack thereof--is a systemic or institutional phenomenon (e.g., collectivist knee-jerk reactions against subjectivism, inductivism, insistence on questionable methodological orthodoxy by the gatekeepers of the science, etc.). Most research streams in the social sciences have therefore become ideological...and this problem has almost nothing to do with the study-specific decisions of *individual researchers*:

http://marcorlitzky.webs.com/Papers/orlitzky2012orm.pdf

The empirical evidence on the extent to which frequentist methods are misapplied should make this clear.

#### rtp10

Basing conclusions (especially causal) off only 1 study is stupid- doesn't matter if its frequentist or bayesian. Replication of results & generalizing the results to other settings is far more important then the findings of 1 study's significance.

#### crquack

There is nothing wrong with using the p-value to determine if a clinical trial result is "statistically" significant or not. The clinical significance of the said result, however, is another matter.

You can show e.g. that if you treat 100,000 people with 'A' you will get 1 negative event (however you choose to define it) and if you treat 100,000 people with placebo you will get 3 negative events. The p value will be less than 0.05, this result is significant (I did not do the math but given the n-number I would be surprised if it was not). If you work for a company that makes 'A' or if you are a publication hound out to make your name you will claim 66.7% reduction in negative events, the sales of 'A' will go up and you will be hailed as a father of a new treatment. In reality however, the absolute reduction of negative events is from 0.003% to 0.001%, i.e. 0.002%. To prevent 1 negative event you have to treat 50,000 patients.

Far less impressive, yet policy decisions are made on these types of numbers.

#### Alan T

Sanjoy,

Thanks for an enlightening post!

In order to use Bayes' Theorem, you need the probability of 20 wins, given cheating. How do you estimate this?

#### Shane L

Deirdre McCloskey and Stephen Ziliak have an interesting book on the use and misuse of statistical significance: "The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives". They make quite a passionate argument that significance is misused regularly in much science.
http://www.amazon.com/The-Cult-Statistical-Significance-Economics/dp/0472050079

#### Robert Fornango

Nice post, Sanjoy! You raise an excellent issue that I have always tried to hammer home in my stats courses: statistical significance versus substantive difference.

I'm a bit of mix when it comes to Bayesian and frequentist methods, and there are many reasons to recommend Bayes theorem. But, I disagree that p-values are inherently nonsense. The logic of a traditional frequentist approach is straight forward, but dependent on clearly spelled out assumptions (i.e. the sampling distribution or data-generating-process). At that point, it is incumbent on the researcher to validate the assumptions of the method, AND (importantly) make sure the rest of their research methodology is adequate.

When it comes to medical research, a large portion of the confusion in results that you bring up is likely due to problems with other aspects of the research, rather than the specific statistical technique used. Most importantly, many large scale studies are based on randomized controlled experiments. While such techniques are the gold standard, they are not fool-proof. And the failure to account for other possible alternative hypotheses (e.g. complex interaction effects and non-linear relationships) makes it difficult to know how "real" the results actually are.

Still, I enjoyed the article very much...I'm sure this is a debate that will continue on for some time...not to be resolved here.