What’s the Probability That Romney Is Leading in California? A Guest Post

A new Reuters/C-SPAN/Zogby Poll has Mitt Romney ahead of John McCain by 37 percent to 34 percent in a poll of 1185 likely Republican voters in California (2.9 percent margin of error). But what is the probability that more likely voters in the state actually support Romney? Given the 2.9 percent margin of error, it’s possible that Romney just got lucky and the pollsters happened to ask an unrepresentative group that disproportionately favored Mitt.

It turns out that it is really easy to use the raw information of the poll (the leader percent, the follower percent, and the size of the poll) to calculate the probability of leading in the population. In winner-take-all elections (which are not the case for many of the primaries), this “probability of leading” is crucially what we should care about — because if people don’t change their minds (and, if undecided, break evenly), this is the probability that the poll leader will win the election. But most people have a very hard time making the calculation in their head.

So take a shot: what do you think is the probability that Romney is leading McCain in the population of likely Republican California voters?

Turns out that Romney’s probability of leading is a whopping 92.7 percent. If you want to calculate your own leader probability, I’ve created a simple Excel spreadsheet where you can plug in the numbers and generate an answer for any poll you want.

The same poll found that Barack Obama led Hillary Clinton in California “by 45 percent to 41 percent, with a margin of error of 2.9 percentage points.” The same analysis suggests that, at the time of the poll, there is a 94.2 percent chance that more probable Democratic voters support Obama than Clinton.

Of course, these probabilities may end up being widely off — either because the poll was poorly done, or because people change their minds. But another advantage of calculating the probable leader statistic is that it builds a better bridge to the prediction markets. Just after the poll was announced, Intrade had Romney’s probability of winning in California as 94 percent (pretty close to the 92.7 percent leader probability). But Obama’s InTrade bond for California was only trading at 59.9 percent — substantially below his leader probability of 94.2 percent.

MSNBC reports that “Clinton held statistically insignificant 1-point leads on Obama in New Jersey and Missouri, well within the margin of error of 3.4 percentage points in both surveys.” But instead of saying that these races are statistical dead heats, it might be more useful to report that Clinton’s probability of leading is 63.6 percent in New Jersey (InTrade comparison: 60 percent) and 63.5 percent in Missouri (InTrade comparison: 67 percent).

The margin of error and the sample size tell the general public very little. How many people even know whether the margin of error represents one or two standard deviations? The probability of leading is much more intuitive, easy to calculate, and gives the public something much closer to the result they actually care about: the probability that the leading candidate will win the election.


Andrew, in my understanding, the random event taking place is actually, "Will this random sample predict the actual leader?" And this is an equivalent probablility than the one Ian mentions. But as many people mentioned, this assumes a completely accurate and unbiased sample in which the only deviation is the random sampling, so I think it's a bad basis for an InTrade figure.

In this specific instance, there is a major flaw with predicting the outcome of the California Primary. Accoring to http://www.slate.com/id/2175496/ 20-35% of the state has already voted before today. Many of them will have voted at a time when Hillary had a huge lead in the polls. So, looking at where the polls are now is not a good indicator of what the primary will offer.


The polls have been so far off that they have proved meaningless this election cycle.

I'd be interested to see a study of how many "plans" so carefully presented and debated are implemented as presented to voters. Right now we've got Hillary debating Barack on the fine details of his healthcare plan and who is covered under hers and not in his. They debate these as if either plan has a chance of being passed as they presented it.

It reminds me of the promise the Democrats made to end this war in the 2006 election, even though they had no power to do it (with the obvious exception of cutting off funding and abandoning our troops, which is not realistic).


Have you guys seen Keith Olbermann's COUNTDOWN show and the "Keith number" he coined? It's the margin of error plus the percentage of undecideds. It makes as much sense as anything else at the polls, but it's a good indicator as to just how useful a poll is. Last week he reported a poll where the Keith number CAME IN SECOND.

James Pringle

The calculation of the likelyhood that a given candidate is in the lead assumes that the error comes just from random sampling the "true" population of voters. Very roughly, for percentages within, say, one standard deviation, this is likely to be an ok assumption. But as the difference gets larger, the more any deviation from random sampling of the "true" population of voters will skew your estimate of how likely a given candidate is to be in the lead.

Another way to put this is at the tails of a distribution, even small errors in your model will be important.

Add in systamatic biases (liberals use cell phones, people lying about voting for black candidates), and you calculation will be even more flakey.

I think this number, while superficially attractive, will be more confusing for most people in the long run.



Similar to this article:


What's the formula for multiple candidates?


Andrew writes that 'The "probability of leading" makes no sense.' But that actually depends. There are two schools of statistics: "Frequentist" and "Baysian". Andrew gives a good summary of how the Frequentist school defines probability. However, the Baysian approach is equally valid mathematically and is often provides more intuitive answers. Baysian statisticians define "the probability of X" as, roughly, how a rational gambler would bet, given certain information. It certainly *does* make sense to ask what kind of odds you should give of Obama winning, given certain polling results.


Could you please explicit the calculus being made ? For saying that there is such a thing as the probability of leading, you must have an ex-ante probability on the result, and then update it with the result of the poll. With the poll, you can have a result and a margin of error without making an ex-ante probability assumption.

Maybe you took as probability the uniform distribution, but it is a strong assumption.

Nick Barrowman

The calculation goes like this: (1) consider just the polling results for the top two contenders; (2) using the estimate of the leader's support and its standard error, perform a one-sided z-test of the hypothesis that the proportion is 50% versus the alternative hypothesis that it is greater than 50%; (3) let p be the one-sided p-value of this test; (4) interpret 1-p as the "probability of leading".

As has been pointed out, step 4 is not frequentist. But no prior distribution has been declared. In fact, as Julien supposes, the implicit prior distribution is (approximately) uniform over [0,1], which is sometimes called "uninformative".

Whether or not this is a "good" choice is debatable. When it comes to prediction markets, I'm not so sure.


The "probability of leading" makes no sense. Things with random outcomes have probabilities but "who is the current leader" is not a random event. Putting aside numerous confounding questions including whether or not the respondents will actually vote, how delegates get awarded, and so on, if we could take a snapshot of likely voter's minds at this moment, somebody would be in the lead -- there is nothing random about the result. If we could take this magic psychic snapshot repeatedly in a short period of time, we would get the same answer every time.

What is random is the sampling done in the poll so it is possible to assign a probability to the outcome of a particular poll. Putting aside even more confounding questions like whether the sampling techniques used when conducting the polls truly yield a random sample, the way to interpret the figures above is as follows:

*If Clinton is in fact favored by more voters*, then the probability that this poll would yield this result (Obama leading 45%-41%) is just 5.8% (100% - 94.2%). Or in other words, one can be fairly confident that Obama is indeed in the lead since otherwise, something rather unusual would have had to occur for the poll to produce this result.



If Romney is ahead by 3%, and the margin of error is 2.9%, shouldn't the probability be in the mid-80s instead of 92.7%?

That is: Romney is leading by about 1 SD. The chance that a normal observation is

Herbert Moore

This is the best post in recent history on this blog. Far better than some personal ramblings that I have read recently. However, could you post the formulas in the excel spreadsheet? Also, how many standard deviations does the margin of error represent? (I think two)


My post yesterday was cut off (presumably due to a "less than" sign being interpreted as HTML). Trying again.

If Romney is ahead by 3%, and the margin of error is 2.9%, shouldn't the probability be in the mid-80s instead of 92.7%?

That is: Romney is leading by about 1 SD. The chance that a normal observation is less than 1 SD is about 84% or something, no?

If the estimates for Romney and McCain were independent, then you'd multiply the 1 SD by the square root of 2, giving 1.5 SD, which DOES work out to about 93%. But they are not independent. In fact, their correlation is minus 1, since whatever votes are not Romney's are McCain's.

Shouldn't the correct probability be 84% and not 93%?

Nick Barrowman

Phil, You're off on a few points.

First, the margin of error represents approximately 2 standard errors (SE) not 1, so the SE is about 1.5%. Next, if you only consider Romney and McCain, Romney is ahead by more like 4%. As you note, the SE of the difference between the estimated support for Romney and McCain cannot be computed by multiplying by the square root of 2. In fact the correct multiplicative factor is 2. (Because their correlation is indeed minus 1, subtracting them doubles their SE.) So Romney is ahead by 4%/(2*1.5%) = 1.33 roughly, which gives a probability of about 91%.


You make an interesting point, and your stats are right of course. But I would take issue with which set of information is more helpful to voters. Perhaps your metric is helpful in very close races where the lead is close to the margin of error, but in races where the lead is substantial (more than 4x sd), your metric will just return 99.9% or 100% every time depending on rounding. I would imagine that this information is pretty useless when seats are assigned in a pro-rata basis or if you wanted to project analysis of this state's polling onto another state. For example think about polling of Giuliani in Florida. The polling said he had a 0% chance of winning, but was it a "close" 0% or a "far" 0%? That matters with respect to whether he would likely continue the race...


The problem with any statistical calculation like this is that it assumes that the poll is unbiased. If we knew for certain that those polled were a representative sample not just of the population in general, but of the voters who actually vote, and we knew no one changed their mind, then we could be 95% confident when a result comes outside of the margin of error. In reality, with most polls like this, there are many other areas for uncertainty, including picking an incorrect sample, missing people who are not home or don't answer their phone, predicting turnout incorrectly, or missing how undecideds will break. The reality is that most polls probably understate their uncertainty in their sampling methodologies. If Ayres is correct then 95% of polls should show Obama winning, which is definitely not the case.

What Ayres is arguing here is that there is a 95% probability that an identical poll would have Obama winning. It's a great argument, for example, that if you sampled ballots after they come in, you should be able to figure out who won with a small sample. But it's not a great argument for trusting polls too much. It's a little too academic.



It would be nice to state what the MoE actually means. IIRC by definition there is 95% chance that the actual number is the projected value +/- 2*MoE.


As mentioned in the post, for winner-take-all states the probability of leading is more important than the particular margin found in the poll. But none of the Democratic primaries are winner-take-all, so the margin is the far more useful number for those.


If you are going to ignore undecideds, then it seems you should reduce the sample size accordingly. For example, if 10% of people in a poll of 100 respond undecided, then isn't the sample size really 90, at least for calculations of the standard error?


Of course, with the benefit of hindsight (well, perhaps its midst-sight since they have just been called by the Times), it looks like California went to Clinton and McCain respectively. So much for the 92.7% (Romney) and 94.2% (Obama) chances of leading.


McCain ended up winning every district, I think.