Can You Trust Census Data?

No. At least that’s the conclusion of an important new paper (ungated version here) by Trent Alexander, Michael Davern and Betsey Stevenson, who find enormous errors in some critically important economic datasets.

Let’s start with the 2000 Decennial Census. Your responses to the Census were used for two purposes. First, the Census Bureau tallied up every response to produce its official population counts. And second, it produced a 1-in-20 sub-sample of these responses, which it made available for analysis by researchers. Just about every economist I know has used this Census sub-sample, as do a fair number of demographers, sociologists, political scientists, and private-sector market researchers.

The errors are documented in a stunningly straightforward manner. The authors compare the official census count (based on the tallying up of all Census forms) with their own calculations, based on the sub-sample released for researchers (the “public use micro sample,” available through IPUMS). If all is well, then the authors’ estimates should be very close to 100% of the official population count. But they aren’t:

Census ChartSource: Inaccurate Age and Sex Data in the Census PUMS Files: Evidence and Implications
Trent Alexander, Michael Davern and Betsey Stevenson

The two estimates are pretty similar for those younger than 65. But then things go haywire, with the alternative estimates disagreeing by as much as 15%. In fact, the microdata suggest that there are more very old men than very old women — I know some senior women who wish this were true! The Census Bureau has confirmed that the problem isn’t with the authors’ calculations. Rather, the problem is in the public-use microdata sample.

What’s the source of the problem? The Census Bureau purposely messes with the microdata a little, to protect the identity of each individual. For instance, if they recode a 37-year-old expat Aussie living in Philadelphia as a 36-year-old, then it’s harder for you to look me up in the microdata, which protects my privacy. In order to make sure the data still give accurate estimates, it is important that they also recode a 36-year-old with similar characteristics as being 37. This gives you the gist of some of their “disclosure avoidance procedures.” While it may all sound a bit odd, if these procedures are done properly, the data will yield accurate estimates, while also protecting my identity. So far, so good.

But the problem arose because of a programming error in how the Census Bureau ran these procedures. The right response is obvious: fix the programs, and publish corrected data. Unfortunately, the Census Bureau has refused to correct the data.

The problem also runs a bit deeper. If the mistake were just the one shown in the above graph, it would be easy to simply re-scale the estimates so that there are no longer too many, say, 85-year-old men — just weight them down a bit. But it turns out that the same coding error also messes up the correlation between age and employment, or age and marital status (and, the authors suspect, possibly other correlations as well). When you break several correlations like this, there’s no easy statistical fix.

Worse still, the researchers find that related problems afflict the microdata released for other major data sources. All told, they’ve found similar errors in:

  • The 2000 Decennial Census.
  • The American Community Survey, which is the annual “mini-census” (errors exist in 2003-2006, but not 2001-02, or 2007-08).
  • The Current Population Survey, which generates our main labor force statistics (errors exist for 2004-2009).

These microdata have been used in literally thousands of studies and countless policy discussions. While the findings of many of these studies aren’t much affected by these problems, in some cases, important errors have been introduced. The biggest problems probably exist for research focusing on seniors. Yes, this means that many of those studies of important policy issues-retirement, social security, elder care, disability, and medicare-will need to be revisited.

The problem is that until the Census Bureau does something about these widespread problems, we can’t even begin this process of cleaning up problematic research findings. Right now, the authors warn that: “The resulting errors in the public use data are severe, and… should not be used to study people aged 65 and over.” Given the long list of afflicted datasets, up-to-date credible research on seniors is virtually impossible.

The whole research community is waiting for the Census Bureau to do something about these problems.

UPDATE: Carl Bialik of the Wall Street Journal also?digs a little deeper into these problems.

Don Sakers

II work in a public library, and the Census Bureau has been using our meeting room for a year to interview recruits, train census workers, and meet with clients.

These people are unable to keep their meeting dates, times, and locations straight. They are always showing up at the wrong branch at the wrong time, calling to cancel meetings that they never booked, or confirming meetings that are booked for a different date/time/location. Interviewees show up expecting census officials who never materialize, or census officials arrive and sit for hours waiting for clients who were given different date/time/place.

Based on my experience with the Census Bureau this time around, I have absolutely zero confidence in any numbers that they produce.


Thanks for helping to get the work out about the 65+ estimates. I know there are aged researchers out there who are unaware of the issue. But what to do? Stop using these data sets? Stop doing research? Point myself in the direction of Census headquarters and focus in an angry stare?


Don: Sounds like the Census people are recoding the microdata on their calendars!


Won't the 2010 census data be out soon, with the bug presumably fixed? Also, why did it take 10 years for this to be discovered? Hopefully more careful scrutiny will be paid to the 2010 data...



Tom Peters

Are they errors or are they indications of skulduggery?

I'm not a conspiracy theorist at heart but during the periods mentioned, some or all of the people employed in compiling the information contained in the datasets mentioned were members of a larger group that is antagonistic toward the census.

That the current census was designed and implemented by those same people is also worthy of note.


To Don Sakers:

Those Census employees you are referring to are likely temporary employees. They probably aren't the same people who are in charge of creating the public use micro sample or doing most of the on-going data analysis that goes into the final numbers.

To Justin Wolfers:

Your blog title, and some of the wording throughout, is a bit misleading. It seems there is only evidence that there are problems with the public use micro sample file. Is there evidence also that their own aggregate data are way off, too? The way this is written could lead someone to conclude the totals are wrong (which they might be) despite the fact that it doesn't appear you've offered up any evidence about that.

Grace Meng

I work for a nonprofit that is working on creating new ways to collect and analyze sensitive data, through something we call a datatrust, and repeatedly, people have asked me, "Will the datatrust's data be as accurate as data collected through careful statistical sampling?" Although we envision our work to complement, not replace, traditional data collection, this article makes clear that the assumption that current ways of collecting and releasing data are perfectly accurate is false. We're currently exploring new technology that deals with the privacy issues by adding noise to data, but in a way that reveals more information about what kind of noise is added, in contrast to the Census where they will not or cannot reveal what they've done to the data to "scrub" it of identifying information. More information about PINQ can be found here:


Jonny Rainbow

It's no secret that the Bush administration allowed the Census Bureau to fall into desuetude in order to prevent accurate future counts of persons most likely to vote for Democrats in key congressional districts (or perhaps everywhere). Critical new technology to be used in the 2010 Census, such as hand-held networked workstation devices, was not completely validated as late as the spring of last year. People hired as temporary Census employees often discovered that they knew more about the Census and its data products than the longer-term Census employees conducting training sessions.

It's probably true that all administrations try to tinker with the Census process somewhat, to promote partisan advantage in the allocation of congressional seats. However, Bush administration neglect of the Census process, which is Constitutionally mandated and should therefore be accorded extra measures of respect, bordered on the criminal, in my opinion.


Kevin H

Is it just because there are fewer people aged 65+, and because of that it is harder to apply "disclosure avoidance procedures" without causing problems?

Joel Johnson

Fair enough, but this post might be timed poorly. I hope people don't use this as an excuse to not fill out their census form.


As a former Census headquarters employee, I am surprised that it is only NOW that such a finding has come to light so prominently. On an individual level, I have never worked with so many people who care so passionately about data and their organization's mission. But collectively the agency suffers from mismanagement and tunnel vision. The geeky statistician worker bees who eventually become managers take such an introspective, internally focused approach to data and their little pieces of the puzzle that they frequently lose sight of the big picture issues and public relations ramifications of their decisions, nor do the managers above them appear to have the integrative capacity to help the various areas work together. The new director is bright, knowledgeable, and innovative but I think his work is cut out for him, having to change a culture that becomes more ingrained year after year, now housed in a disappointing, substandard, and expensive new office space that smacks of cut corners and contractor skimming. Staff are often rewarded for short-term heroics and the systems serve to diffuse responsibility rather than to encourage ownership of projects or problems.

At the same time, I can tell you the demographic survey process is a monstrous one, both from having to manage a field staff who, with little pay, have an absolutely treacherous job of collecting data from a reluctant public either in the ongoing surveys or the decennial census, to the extremely tedious process of editing data at headquarters. Unfortunately the technical documentation produced by Census fails to capture the enormity and complexity of the challenges in collecting data.

The author is right in that Census alters the microdata to protect confidentiality and that the process of data swapping can introduce error into the data sets. But I think there are threats to data quality beyond the data swapping -- from confusing questionnaires that do not accurately capture information, to respondent non-response, to inadequate data management, editing and imputing systems, not to mention a work culture than in some areas (but surely not all) is more focused on the scheduling of branch lunches and human resource squabbles than producing quality data.

Another very serious problem is that Census rarely receives funding these days for staff to run their own data analyses. If they did, some of the problems with the data might come to light before the public use microdata sets were released.

I hope this sunshine, particularly in light of the management problems that came to light as a result of the handheld computer debacle, inspires Census staff and management to reevaluate their work processes and work output in a way that both makes working for Census a more gratifying experience and also improves the quality of data for the user.



Seriously doubt accuracy of the census. It's been inadequate since before I was born. Big, old government boondoggle.


I presume the Census Bureau "refused" to produce new data fixing the mistake because it thinks doing so will violate the privacy of respondents. It makes sense that by filtering the data to focus on the changes between the incorrect and corrected report, one could in principle decode the resorting algorithm. It would only be fair if you explained this fact, or any other reason the Census Bureau gives, rather than make it look like they resist purely out of obstinance.


I've never trusted the census, and I think it's pointless. Yes, I know there are a multitude of uses for knowing/citing population, demographics, etc, but I'm sure the census is completely off.

In 2000, I lied like crazy on my census form -- everything from race to number of people in my home to income. It's none of the government's business! Plus, I had a friend who worked for the census one year. He was hired to go door-to-door and interview residents. He actually spent his days driving up to the buildings, guesstimating what the answers would be, filling in his forms and going home.

Mark B.

Actually, we would get better data if we just did a national probability sample. It would be cheaper and more accurate. Plus there wouldn't be any (of the same) worries about anonymity.


Mare (#15). It is actually the government's business. Reference the US Constitution: "The actual Enumeration shall be made within three Years after the first Meeting of the Congress of the United States, and within every subsequent Term of ten Years, in such Manner as they shall by Law direct. " In any case your friend violated his terms of employment by not fulfilling his job duties.


oh my - these are some unfortunate comments here.
We know for a fact that the census is not "completely off", because we can check it with other data sources.

Also, while some people bring up good points and do realize what the actual issue is, others don't even seem to realize that the method of the paper is to compare actual census data (summary tables) with the results from the anonymized (public use) data. So the paper actually assumes that the data gathering etc. is going well.

And for the people who thing a couple of anecdotes of people lying on census forms shows the entire thing is off - as they like to say - the plural of anecdote is not data...

Larry G.

#17: "Reference the US Constitution: "The actual Enumeration shall be made ... within every subsequent Term of ten Years..."."

Personally, I refuse to fill out the forms, and when the volunteer arrives at my house, I provide ONLY the number of voting-age people living at my house. If threatened, I provide false additional information.

The government does NOT have a Constitutional right to any additional information, beyond number of voters!

#18: The plural of anecdote is "corrupt data".


@19 - The Constitution calls for an enumeration of all persons in the United States, not just those of voting-age.