A new Reuters/C-SPAN/Zogby Poll has **Mitt Romney** ahead of **John McCain** by 37 percent to 34 percent in a poll of 1185 likely Republican voters in California (2.9 percent margin of error). But what is the probability that more likely voters in the state actually support Romney? Given the 2.9 percent margin of error, it’s possible that Romney just got lucky and the pollsters happened to ask an unrepresentative group that disproportionately favored Mitt.

It turns out that it is really easy to use the raw information of the poll (the leader percent, the follower percent, and the size of the poll) to calculate the probability of leading in the population. In winner-take-all elections (which are not the case for many of the primaries), this “probability of leading” is crucially what we should care about — because if people don’t change their minds (and, if undecided, break evenly), this is the probability that the poll leader will win the election. But most people have a very hard time making the calculation in their head.

So take a shot: what do you think is the probability that Romney is leading McCain in the population of likely Republican California voters?

Turns out that Romney’s probability of leading is a whopping 92.7 percent. If you want to calculate your own leader probability, I’ve created a simple Excel spreadsheet where you can plug in the numbers and generate an answer for any poll you want.

The same poll found that **Barack Obama** led **Hillary Clinton** in California “by 45 percent to 41 percent, with a margin of error of 2.9 percentage points.” The same analysis suggests that, at the time of the poll, there is a 94.2 percent chance that more probable Democratic voters support Obama than Clinton.

Of course, these probabilities may end up being widely off — either because the poll was poorly done, or because people change their minds. But another advantage of calculating the probable leader statistic is that it builds a better bridge to the prediction markets. Just after the poll was announced, Intrade had Romney’s probability of winning in California as 94 percent (pretty close to the 92.7 percent leader probability). But Obama’s InTrade bond for California was only trading at 59.9 percent — substantially below his leader probability of 94.2 percent.

MSNBC reports that “Clinton held statistically insignificant 1-point leads on Obama in New Jersey and Missouri, well within the margin of error of 3.4 percentage points in both surveys.” But instead of saying that these races are statistical dead heats, it might be more useful to report that Clinton’s probability of leading is 63.6 percent in New Jersey (InTrade comparison: 60 percent) and 63.5 percent in Missouri (InTrade comparison: 67 percent).

The margin of error and the sample size tell the general public very little. How many people even know whether the margin of error represents one or two standard deviations? The probability of leading is much more intuitive, easy to calculate, and gives the public something much closer to the result they actually care about: the probability that the leading candidate will win the election.

This is the best post in recent history on this blog. Far better than some personal ramblings that I have read recently. However, could you post the formulas in the excel spreadsheet? Also, how many standard deviations does the margin of error represent? (I think two)

Similar to this article:

http://www.washingtonmonthly.com/archives/individual/2004_08/004536.php

What’s the formula for multiple candidates?

The calculation of the likelyhood that a given candidate is in the lead assumes that the error comes just from random sampling the “true” population of voters. Very roughly, for percentages within, say, one standard deviation, this is likely to be an ok assumption. But as the difference gets larger, the more any deviation from random sampling of the “true” population of voters will skew your estimate of how likely a given candidate is to be in the lead.

Another way to put this is at the tails of a distribution, even small errors in your model will be important.

Add in systamatic biases (liberals use cell phones, people lying about voting for black candidates), and you calculation will be even more flakey.

I think this number, while superficially attractive, will be more confusing for most people in the long run.

Jamie

Have you guys seen Keith Olbermann’s COUNTDOWN show and the “Keith number” he coined? It’s the margin of error plus the percentage of undecideds. It makes as much sense as anything else at the polls, but it’s a good indicator as to just how useful a poll is. Last week he reported a poll where the Keith number CAME IN SECOND.

You make an interesting point, and your stats are right of course. But I would take issue with which set of information is more helpful to voters. Perhaps your metric is helpful in very close races where the lead is close to the margin of error, but in races where the lead is substantial (more than 4x sd), your metric will just return 99.9% or 100% every time depending on rounding. I would imagine that this information is pretty useless when seats are assigned in a pro-rata basis or if you wanted to project analysis of this state’s polling onto another state. For example think about polling of Giuliani in Florida. The polling said he had a 0% chance of winning, but was it a “close” 0% or a “far” 0%? That matters with respect to whether he would likely continue the race…

The problem with any statistical calculation like this is that it assumes that the poll is unbiased. If we knew for certain that those polled were a representative sample not just of the population in general, but of the voters who actually vote, and we knew no one changed their mind, then we could be 95% confident when a result comes outside of the margin of error. In reality, with most polls like this, there are many other areas for uncertainty, including picking an incorrect sample, missing people who are not home or don’t answer their phone, predicting turnout incorrectly, or missing how undecideds will break. The reality is that most polls probably understate their uncertainty in their sampling methodologies. If Ayres is correct then 95% of polls should show Obama winning, which is definitely not the case.

What Ayres is arguing here is that there is a 95% probability that an identical poll would have Obama winning. It’s a great argument, for example, that if you sampled ballots after they come in, you should be able to figure out who won with a small sample. But it’s not a great argument for trusting polls too much. It’s a little too academic.

The “probability of leading” makes no sense. Things with random outcomes have probabilities but “who is the current leader” is not a random event. Putting aside numerous confounding questions including whether or not the respondents will actually vote, how delegates get awarded, and so on, if we could take a snapshot of likely voter’s minds at this moment, somebody would be in the lead — there is nothing random about the result. If we could take this magic psychic snapshot repeatedly in a short period of time, we would get the same answer every time.

What is random is the sampling done in the poll so it is possible to assign a probability to the outcome of a particular poll. Putting aside even more confounding questions like whether the sampling techniques used when conducting the polls truly yield a random sample, the way to interpret the figures above is as follows:

*If Clinton is in fact favored by more voters*, then the probability that this poll would yield this result (Obama leading 45%-41%) is just 5.8% (100% – 94.2%). Or in other words, one can be fairly confident that Obama is indeed in the lead since otherwise, something rather unusual would have had to occur for the poll to produce this result.

It would be nice to state what the MoE actually means. IIRC by definition there is 95% chance that the actual number is the projected value +/- 2*MoE.