The Dangers of Too Much Data

Wondering whether aspirin will protect your heart or cause internal bleeding? Or whether you should kick your coffee habit or embrace it? It’s often hard to make sense of the conflicting advice that comes out of medical research studies. John Timmer explains that our statistical tools simply haven’t kept up with the massive amounts of data researchers now have access to. In medical (and economic) research, scientists claim a “statistically significant” finding if there’s a less than 5% chance that an observed pattern (between coffee and liver disease, for example) occurred at random. In the new age of data, that rule causes problems: “Even given a low tolerance for error, the sheer number of tests performed ensures that some of them will produce erroneous results at random.” In lay terms, all those new tests you get at the doctor’s office are translated into data sets, which researchers then pore over searching for connections and patterns. And, if you have enough data to examine, eventually you’ll find a statistically significant relationship where no such relationship actually exists — by sheer coincidence. (HT: Matthew Rotkis)[%comments]


that is not the problem with an abundance of data, nor is it a problem with our statistical tools. it is a problem with how people inappropriately apply those tools to the abundant data.


Can somebody please explain this to me? To my understanding, the author is saying this:. The more data you have, the more likely you'll be able to find a statistically significant relationship

Let's say that your hypothesis was that peole who read Freakonomics are more likely to eat spinach.

Then I agree that the larger your sample size of readers, the more people you will find who read Freakonomics and eat spinach.

But isn't that offset by the large amounts of people who don't read Freakonomics but also eat spinach?

i.e. that P(ppl who eat spinach | they read Freakonomics) won't really change.

Which means that the significance test actually won't be different either?

Or am I missing something?

Mark B.

Dzof, significance tests are computed with the help of the "standard error", which is a measurement influenced, in part, with the help of the sample size. The larger the sample size, the smaller the standard errors, and the better the chance of finding a statistically significant result (a result that differs from zero).

I disagree with the post's author in that this is not a problem with having lots of data, or doing lots of tests on data. Instead its a problem with (almost) exclusively relying on p-values as tests of statistical significance rather than a combination of p-values and effect sizes.

Robert Grant


I think the point is that if you looked at the set of people who read Freakonomics, the probability that they will also all eat spinach is very low, but the probably that they all do *something* (own a dog/use Firefox/go jogging three times a week/drink Pepsi) gets higher the more things you compare them against.

I.e. groups will sometimes overlap strongly even with there being no reason for it, if enough groupings are measured. I think it's just the standard correlation != causation thing, restated.



The problem isn't with data sets containing many data points. Those data are, and will always be, the most statistically reliable.

The point of the article is that we are generating a lot of data sets with a modest number of data points. If you have thousands of data sets, at least a few of them will erroneously show a statistical relationship at 5% confidence by pure chance.


What the author is trying to say is:
When CNN says that a new study shows a link between consuming 5 cups of coffee per day and liver disease, then CNN actually means that the study found a "statistically significant" relationship between consuming 5 cups of coffee per day and liver disease. However, the standard threshold for deciding that something is "statistically significant" is that there is a less than 5% chance the link observed in the study occurred randomly.

So looking at just one study, it is likely (95% chance) that such a link exists. However, if you have lots of studies each year that show some "statistically significant" relationship, then you would expect about 5% of those results to have been actually produced by randomness and not by any actual causal effect. Basically, if the threshold of "statistical significance" is a less than 5% chance of randomness causing the result, then out of 100 studies that show a relationship, you would expect the relationship espoused by the study to be wrong for 5 of the studies.

It is probably not quite as simple as that since some studies will show a much less than 5% chance of randomness causing the relationship, but that's the gist of the post.



@Dzof: It's more like this: if you have data for readers of thousands of books, and what they eat, and you run enough studies, eventually you are going to find a "statistically significant" correlation between readers of one book (say, Freakonomics) and eaters of some food (say, spinach). But it may simply be one of those 5% of cases where the pattern actually did occur at random.


Great article. I'm a graduate student whose thesis work involves both genomic microarray and "wet lab" data. I do lots of work with datasets involving >50,000 probesets for each of 50-100 samples. There is always a balance between a low false discovery rate and setting cutoffs so stringently that real differences are missed.

My advisor requires the following two rules be applied after the statistical algorithm of choice has spit out the top 1000 statistically significant associations.

1) Plot the data graphically and ask if it passes the "eyeball test." You should be able to glance at the graph and immediately see if there one group is dramatically different from the others. This works best if you do a "dot plot," in which every sample/patient/mouse/whatever is represented by a single point and not a bar graph that compresses everything to averages.

2) Differences that are real and biologically important do not require p-values, chi squared, FDR, fancy transformations or any other statistical test to be believable. If the graph is unimpressive and you have to squint to see that maybe group A is different from group B, but golly, the p value is's not likely to be real.

It follows from these two rules that there is no substitute for looking at the data or graphs in the original study. Science reporting would be a lot clearer to the general public if graphical representations were included with articles. Not likely, but I can dream, can't I?



The problem seems to lie not in having too much data, as having less wouldn't make us better informed. Rather it is in the acceptance of the 95% confidence interval for medical analysis.


Reminds me of the bible codes ! If you have enough data you'll find some interesting random pattern !


As more studies are done, the percentage of Type I errors might not change (assuming everyone has been using the same significance level), but in absolute number terms there will be more results out there which are reporting faulty conclusions (Type I errors) because of the larger number of studies.

As another poster mentioned, that doesn't mean we should prefer fewer data sets or fewer studies. It just means that we need to be aware that Type I errors will exist and rigorously test the same hypotheses over time to double check results that are published in the media.


Reminds me of the WSJ's provocative article: "Most Science Studies Appear to Be Tainted By Sloppy Analysis" (, which seems to apply to more than just medicine.


The problems don't end with medicine. Mercurial statistical outcomes are everywhere. I never had a problem with them until politicians started using them to extort money from the befuddled hordes.


This is why you still need qualitative analysis. You need to be able to explain something, not just show a correlation. If you can only show a correlation and have no qualitative basis to use for analysis, you have little other than an impressive graph for a power point slide.

What is unfortunate is that most people will probably believe the pretty graph on the power point slide more readily than the wall of text that would make up a qualitative analysis. This is a separate issue however.

gevin shaw

Many of the reversals in medical advice so well noticed over the years are a result of sloppy reporting. Correlations statistically significant enough to warrant research are reported as findings in lurid headlines rather than just an announcement of research into a statistically significant correlation. A couple of years later, the findings of the actual research can then appear as a refutation of a previous finding when it's really just an explanation of a previously reported correlation.

I'd imagine researchers, in search of grants and budget increases and contract renewals, contribute to this in their press releases, not lying or deceiving, but playing up the more lurid and exciting aspects of the raw data.

Michael F. Martin

Why was this post not signed, I wonder?


Well this is why people use things like bonferroni corrections


This is nothing compared to apparent perfection of pathology labs. Every blood test I have ever had has given me an exact answer with no 95% confidence interval or std error.

Vinh Nguyen

I would have to disagree with the post and the referred article.

There is no chance of 5% that an observed pattern occured at random, that is a MISINTERPRETATION. Please talk to a real statistician to get the interpretations right. The 5% error level is saying: if in reality there is NO treatment effect, then the probability (under this scenario) that we falsefully claim there IS a treatment effect is 5%, assuming the test is done at the 5% level. 5% is just a standard and can be brought down. I do, however, agree that if you do MULTIPLE TESTING/COMPARISONS, you inflate this Type I error.

If a study is well-planned, see eg confirmatory (phase III) trials seeking approval with the FDA, investigators define the population of interest and define the PRIMARY question of interest. They answer the primary question of interest using standard yet robust statistical methods. They will guard against a false-positive using a 0.05 or .025 error rate. The primary question of interest is then answered.

The above examples where people do multiples tests and fish for "significant" results are examples of bad science, and you will never see that with the FDA. The FDA has probably the highest level of standard of integrity when it comes to statistical analyses.



The real problem is not the large amount of data, nor (mainly) the probabilistic approach, nor the large number of studies. The problem is cherry-picking *which* studies are published and talked about.

In the most extreme case, a pharma company could keep doing repeated studies about a new drug, 29 of which show it to have no effect, until the 30th has several patients get better by pure chance - and then publish only that one study and bury the rest. There are strong indications that that kind of thing is actually happening.

And even when negative studies are published, you can get pretty far by talking only about those with the desired result - the homeopathy industry is pretty good at that.