Is "Statistically Significant" Really Significant?

A new paper by psychologists E.J. Masicampo and David Lalande finds that an uncanny number of psychology findings just barely qualify as statistically significant.  From the abstract:

We examined a large subset of papers from three highly regarded journals. Distributions of p were found to be similar across the different journals. Moreover, p values were much more common immediately below .05 than would be expected based on the number of p values occurring in other ranges. This prevalence of p values just below the arbitrary criterion for significance was observed in all three journals.

The BPS Research Digest explains the likely causes:

The pattern of results could be indicative of dubious research practices, in which researchers nudge their results towards significance, for example by excluding troublesome outliers or adding new participants. Or it could reflect a selective publication bias in the discipline – an obsession with reporting results that have the magic stamp of statistical significance. Most likely it reflects a combination of both these influences. 

“[T]he field may benefit from practices aimed at counteracting the single-minded drive toward achieving statistical significance,” say Masicampo and Lalande.


Andy W

I was recently pointed to a similar paper on SSRN, http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2089580, examining p-values reported in economics journals.

Star Wars: The Empirics Strike Back (Brodeur et al., 2012)

Joe

This has been a problem for years. Any decent grad student in the hard sciences will have a stack of papers about Type I and II errors, the file drawer problem, publication bias, and the various other names given to this group of issues.

In my field there has been an effort to move away from searching for statistical significance and towards statistical estimates of effect size. The effort moves at a glacial pace, since so many scientists don't want to learn something new to replace what they already don't understand.

The core problem is that most scientists don't understand the statistics they're using and most statisticians don't understand the role of statistics in the sciences.

Submit a paper using an analysis that makes more sense, and at least one reviewer will require you to shoehorn it back into the good 'ol significance test.

MeanOnSunday

This is common to all scientific fields. Many of the major medical journals now require the study protocols to be registered publicly before the study starts, and that the authors submit a detailed analysis plan with the paper. Some even require the authors to provide the data to an independent 3rd party to verify the results.

An excellent reference to understanding research results and putting them in the proper context is John Ioannidis' "Why Most Research Findings are False"

http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124

GKB

Or it could be a reflection of efficiency. When designing an experiment the second thing you do after defining your null and alternative hypothesis is to calculate the minimum sample size required to detect a significant difference at an alpha of 0.05. Generally you then add a buffer to account for drop outs and missing data etc.

If your initial assumptions are accurate and you recruit the minimum number required...your test will be significant with a p value just under 0.05. If you have a p value of 0.0001 it means you either underestimated the effect size or variability of the effect and had a lot more subjects than required to dismiss the null.

Of course, at least in medical research, it is well established that top journals are much more likely to publish a study with significant results as these will lead to a change in practice.

Seminymous Coward

The paper doesn't seem to be public, so I guess I'll ask here. Is there a nifty chart like the one from http://www.freakonomics.com/2011/07/07/another-case-of-teacher-cheating-or-is-it-just-altruism/ that made the problem so transparent?

L Lehmann

Also even if a finding is found (legitimately) to be statistically significant, this should not be confused with its having clinical significance. A finding may be highly statistically significant yet clinically meaningless.

Paul S

The "field" might benefit even more from practices aimed at developing a reasonable level of humility. The trouble is, saying that "in our 'study' we really did not find out much of anything that anyone can rely on" is probably not an effective way to angle for the next grant. So it goes.

Anonymous Coward

Obligatory XKCD link:

http://xkcd.com/882/

alex in chicago

Isn't it quite obvious? The most "groundbreaking" studies will most often be the ones that have dubious significance values because they are finding results that aren't there, or are marginal.

Eric M. Jones.

No discussion about the validity of studies in psychology is complete without reading Richard Feynman's "Cargo Cult Science" tale of this sort of thing, particularly Young's Rat Maze experiment:

Google "Richard Feynman's Caltech commencement address 1974 Young rat maze"

Eimantas

I may be wrong but it seems to me most papers get published and cited for their scientific sense making in the first place as opposed to only producing statistically significant results.

Karel Petrak

Let's not forget that in statistics, "significant" does not mean "important" or "meaningful". It is simply a test to indicate to what extent it is or is not likely that an event that is being observed may have occurred by chance.
Depending on the design of a given experiment, "statistical significance" does not tells us whether the result has any practical, "real" importance.
As long as it is a condition for having a paper accepted for publication that the reported data must have "statistical significance" it should not come as a surprise that some researchers may be tempted to "massage" their data into a suitable shape...

Ariel Casey

Statistics suck sometimes. I suffered from Severe Traumatic Brain Injury last year after being hit by a car. The best possible outcome the neurologists could give my husband was that if I lived, I would be a vegetable in a nursing home for the rest of my life. Nuts to that! But what I've learned scoping the internet for TBI information is that out of the 1.7 million people who suffer from it each year 50,000 die, the rest experience changes in mental abilities/personalities, amnesia, long or short term problems with independent functions. Since my given proclivities to NOT dying and working as hard as I can I got my brain back no problem. And I only graduated high school! So, does my outlier-ness skew the statistics? Or does the fact that such a hidden ailment for people (athletes, soldiers, people hit by cars, kids who don't wear bike helmets, etc.) get so ignored there are not enough statistics?

Read more...

Ian Ker

As I understand it, the reliance on 95% 'statistical significance' originated in biology and the health sciences area. It has, however, spread into the area of social sciences where (a) the data are often less reliable for non-statistical reasons and (b) the 'requirement' is often used to deny the validity of alternative ways of approaching problems.

We should question whether such statistical ‘requirements’ (principally 95% confidence) are appropriate given that an intervention is in areas of public policy where failure does not result in unacceptable consequences. In simple terms, should we deny ourselves a 90% possibility of achievement because of a 10% possibility that we will not achieve what we set out to?

Karel

One should first design a problem, a task, determine how to measure the outcome, and then quantify it. Statistics simply adds information/probability on whether the outcome could be a random event or not.
In case of social sciences one could question whether using statistics serves any purpose since you already know that the result is most likely an outcome of a random sequence of events...