How to Be Less Terrible at Predicting the Future

Listen now:
How did typical Americans with no foreign-policy expertise come to make remarkably accurate predictions for U.S. intelligence officials? Not with Magic 8 balls. (photo: frankieleon)

How did typical Americans with no foreign-policy expertise come to make remarkably accurate predictions for U.S. intelligence officials? Not with Magic 8 balls. (photo: frankieleon)

Our latest Freakonomics Radio episode is called “How to Be Less Terrible at Predicting the Future.” (You can subscribe to the podcast at iTunes or elsewhere, get the RSS feed, or listen via the media player above.)

Experts and pundits are notoriously bad at forecasting, in part because they aren’t punished for bad predictions. Also, they tend to be deeply unscientific. The psychologist Philip Tetlock is finally turning prediction into a science — and now even you could become a superforecaster.

Below is a transcript of the episode, modified for your reading pleasure. For more information on the people and ideas in the episode, see the links at the bottom of this post. And you’ll find credits for the music in the episode noted within the transcript.

*     *     *

[MUSIC: Pat Andrews, “Whoa”]

There’s a website called Fantasy Football Nerd. It aggregates predictions from roughly 40 NFL pundits to produce what it calls “the industry’s most accurate consensus rankings.” Now, how accurate is the consensus? Let me give you an example: Earlier this season, the Carolina Panthers were playing the Seattle Seahawks. Only two of the pundits picked Carolina to win; 36 picked Seattle. And you could see why. Seattle has been one of the best teams in the league for the past several seasons; they won the Super Bowl two years ago, nearly repeated last year. They’d be playing Carolina in Seattle, where the home crowd is famously — almost punishingly — supportive. So even though Seattle had won only two games this season against three losses, and even though Carolina was an undefeated 4-0 at this point, the experts liked Seattle. They liked their pedigree. But Carolina won the game, 27-23.

MICK MIXON (courtesy of Carolina Panthers Radio Network): It’s the hook and ladder. Lockette has it. Lockette is being tackled. Flips it. Ball’s loose. Recovered by Seattle at the 40. Carolina has won the football game! What an unbelievable, validating, respect-taking road win for the Carolina Panthers!

Soon afterward, Carolina quarterback Cam Newton faced the media.

REPORTER: Cam, before the Seattle game, a lot of the national media was down on this team. After you guys won that game, now a lot of the national media says, “This is one of the best teams we’ve seen this year.” Do you ever find it comical, the way that a lot of these people think that “hey, this team is on, this team is not good.”

CAM NEWTON: I find all media comical at times. Because I think in your guys’ profession, you can easily take back what you say, and you don’t get — there’s no danger, you know, when somebody says it. You know, if there was a pay cut or if there was an incentive, if picking teams each and every week, you may get a raise, I guarantee people would be watching what they say then.

So, first of all, let’s give Cam Newton a medal. Because he just articulated, in about 10 seconds, a big problem that experts in many fields — along with TV producers and opinion-page editors and government officials — either fail to understand or acknowledge, which is this: when you don’t have skin in the game, and you aren’t held accountable for your predictions, you can say pretty much whatever you want.

JONATHAN BALES: I completely agree with Cam on that.

[MUSIC: 40 Watt Hype, “Three and Out” (from Grand Unification Theory)]

That’s Jonathan Bales.

BALES: A lot of the beat writers in the NFL or across sports, they just can say what they want and there is no incentive for them to be correct. And I do think that for the most part they are very bad at making predictions.

Bales can’t afford to be bad, because he plays fantasy sports for a living.

BALES: People who have something to lose from their opinions or the predictions that they make, are incentivized to make sure that they’re right.

Bales is 30 year old. He lives in Philadelphia. He’s written a series of books called Fantasy Football for Smart People. In college, he was a philosophy major but he also loved to analyze sports.

BALES: Yeah, I was really interested in in-game strategy.  So, why are coaches doing all these things that, even anecdotally, they just seem very wrong.

Many of the best fantasy-sports players, he says, have a similar mindset.

BALES: We question things, and we want to improve, and we ask “why?” a lot.  Like, “Why am I making lineups this way? Is this truly the best way?”  Just always questioning everything that we do, taking a very, very data-driven approach to fantasy and adapting and evolving.

Adapting and evolving. Using data to make better decisions. Challenging the conventional wisdom. That all doesn’t sound so hard, does it? Wouldn’t you think that all experts everywhere would do the same? Or, at the very least, wouldn’t you think that we would pay better attention to all the bad predictions out there — the political and economic and even sports predictions — and then do something about it? Why isn’t that happening?

PHILIP TETLOCK: That is indeed the $64,000 question: Why very smart people have been content to have so little accountability for accuracy in forecasting.

Today on Freakonomics Radio: let’s fix that! And while we’re at it, why don’t we all learn to become not just good forecasters but … superforecasters!

[MUSIC: Tim Besamusca, “Wars Between The Stars Theme”]

*     *     *

[MUSIC: Sarah Schachner, “AM Stinger” ]

If you’re a longtime listener of this program, you’ve met Philip Tetlock before.

TETLOCK: I’m a professor at the University of Pennsylvania, cross-appointed in Wharton and in the School of Arts and Sciences.

We spoke with Tetlock years ago, for an episode called “The Folly of Prediction.”

TETLOCK: I think the most important takeaway would be that the experts think they know more than they do; they were systematically overconfident.

Which is to say that a lot of the experts that we encounter, in the media and elsewhere, aren’t very good at making forecasts. Not much better, in fact, than a monkey with a dart board.

TETLOCK: Oh, the monkey with a dartboard comparison — that comes back to haunt me all the time.

[MUSIC: Danny Massure, “Mama Didn’t Lie” (from What It Is)]

Back then, I asked Tetlock to name the distinguishing characteristic of a bad, and overconfident, forecaster.

TETLOCK: Dogmatism.

DUBNER: It can be summed up that easily?

TETLOCK: I think so. I think an unwillingness to change one’s mind in a reasonably timely way in response to new evidence. A tendency, when asked to explain one’s predictions, to generate only reasons that favor your preferred prediction and not to generate reasons opposed to it.

Tetlock knows this because he conducted a remarkable, long-term empirical study, focused on geopolitical predictions, with nearly 300 participants.

TETLOCK: They were very sophisticated political observers. Virtually all of them had some postgraduate education. Roughly two-thirds of them had Ph.D.s. They were largely political scientists, but there were some economists and a  variety of other professionals as well.

This study became the basis of a book that Tetlock titled Expert Political Judgment. It was a sly title because the experts’ predictions often weren’t very expert. Which, to Philip Tetlock, is a big problem. Because forecasting is everywhere.

TETLOCK: People often don’t recognize how pervasive forecasting is in their lives — that they’re doing forecasting every time they make a decision about whether to take a job or whom to marry or whether to take a mortgage or move to another city. We make those decisions based on implicit or explicit expectations about how the future will unfold.  We spend a lot of money on these forecasts. We base important decisions on these forecasts. And we very rarely think about measuring the accuracy of the forecasts.

Some of us may have been satisfied to merely identify and describe this problem, as Tetlock did. Some of us might have gone a bit further and raised our voices against the problem. But Tetlock went even further than that. He put together a team to participate in one of the biggest forecasting tournaments ever conducted. It was run by a government agency called IARPA.

TETLOCK: IARPA is Intelligence Advanced Research Projects Activity. And it is modeled somewhat on DARPA. It aspires to fund cutting-edge research that will produce surprising results that have the potential to revolutionize intelligence analysis.

And Tetlock was at the center of this cutting-edge research. He tells the story in a new book, called Superforecasting, co-authored by the journalist Dan Gardner. The book is both a how-to, if at a rather high level, and a cautionary tale, about all the flaws that lead so many people to make so many bad forecasts: dogmatism, as we mentioned earlier; a lack of understanding of probability; and a reliance on what Tetlock calls “vague verbiage.”

DUBNER: In the book you mention a couple cases from history where the intelligence community did not do so well. The Bay of Pigs situation with JFK and then later the belief that Saddam Hussein had weapons of mass destruction. In both instances you write that it wasn’t about bad intelligence, it was about how the intelligence was communicated to government officials and to the public. So, what happened in those cases?

TETLOCK: Well, in the context of the Bay of Pigs, the Kennedy administration had just come into power and they were considering whether to support an effort, by Cuban exiles and CIA operatives and others, to launch an invasion to depose Castro in April ’61. And the Kennedy administration asked the Joint Chiefs of Staff to do an independent review of the plan and offer an assessment of how likely this plan was to succeed. And I believe the vague-verbiage phrase that the Joint Chiefs analysts used was they thought there was a “fair chance of success.” And it was later discovered that by “fair chance of success” they meant about one in three. But the Kennedy administration did not interpret “fair chance” as being one in three. They thought it was considerably higher. So, it’s an interesting question of whether they would have been willing to support that invasion if they thought the probability were as low as one in three.

DUBNER: As a psychologist, though, you know a lot about how we are predisposed toward interpreting data in a way that confirms our bias or our priors or the decision we want to make, right? So, if I am inclined toward action and I see the words “fair chance of success,” even if attached to that is the probability of 33 percent, I might still interpret it as a move to go forward, yes?

TETLOCK: Absolutely. That’s one of the ways in which vague-verbiage forecasts can be so mischievous. It’s very easy to hear in them what we want to hear. Whereas I think there’s less room for distortion if you say “one-in-three” or “two-in-three” chance. It’s a big difference between a one in three chance of success and a two in three chance of success.

DUBNER: A difference of one, if I’m doing my math properly.

TETLOCK: Right.

DUBNER: Now, the Bay of Pigs didn’t really change much in the intelligence community, you write. Surprisingly perhaps. But the WMD issue with Saddam Hussein in Iraq was an embarrassment to the point that the government wanted to do something about it. Is that about right, that IARPA was founded in part out of response to that?

TETLOCK: I’m not sure I understand all of the internal decisions inside the intelligence community but I think that the false-positive judgment on weapons of mass destruction in Iraq did cause a lot of soul-searching inside the U.S. intelligence community and made people more receptive to the creation of something like IARPA, yes.

IARPA was formed in 2006. One of its major goals is – and I quote — “anticipating surprise.”

[MUSIC: Dot Dot Dot, “Standing On Top of the World”]

TETLOCK: I think that’s why they decided to fund these forecasting tournaments.

These forecasting tournaments would deal with real issues.

TETLOCK: They all had to be relevant to national security, according to the intelligence community.  

DUBNER: For instance?

TETLOCK: So, whether Greece would leave the Eurozone was considered to be an event of national-security relevance.

Some other questions:

MARY SIMPSON: Whether the Muslim Brotherhood was going to win the elections in Egypt.

BILL FLACK: Would the president of Austria remain in office?

These are a couple of the forecasters on Tetlock’s team.

SIMPSON: Will Russia’s credit rating decline in the next eight weeks?

FLACK: There was the notorious China Sea question about whether there would be a violent confrontation around the South China Sea.

TETLOCK: We were one of five university-based research programs that were competing. And the goal was to generate the most accurate possible probability estimates.

DUBNER: What was IARPA trying to accomplish? Were they trying to really crowdsource intelligence? Were they trying to figure out how government intelligence could improve itself? Or what?

TETLOCK: Well, I think crowdsourcing and improvement of probabilistic accuracy they saw as deeply complementary goals.

DUBNER: OK.

TETLOCK: They set up the performance objectives in 2011, very much based on in-the-wisdom-of-the-crowd tradition. The idea being that the average forecast derived from a group of forecasters is typically more accurate than the majority, often the vast majority of forecasters from whom the average was derived. So they wanted to see whether or not we could do 20 percent better than the average, 30 percent, 40 percent, 50 percent as the tournament went on.

DUBNER: OK, so what did you name your team?

TETLOCK: The Good Judgment Project.

It was an optimistic name, if nothing else. The team was put together by Tetlock; his research and life partner Barbara Mellers, who also teaches at Wharton; and Don Moore, from the Haas business school at Berkeley. But here’s the thing: you didn’t have to be an academic, or an expert of any kind, to join the Good Judgment Project or any of the other teams in the IARPA tournament. Anyone could sign up online – and tens of thousands of people did, eager to make forecasts about global events.

TETLOCK: Each of the research programs had its own distinctive philosophy and approach to generating accurate probability judgments. I think we were probably the most eclectic and opportunistic of the research programs and I think that helped. And…

DUBNER: Eclectic and opportunistic how? What do you mean by that?

TETLOCK: Well, I think we were ready to roam across disciplines fairly freely. We just didn’t care that much about whether we offended particular academic constituencies by exploring particular hypotheses. So we got a lot of pushback on a lot of the things we considered. There was a big debate, for example, about whether it would be a good idea to have forecasters work in teams. And we didn’t really know what the right answer was. There were some good arguments for using teams. There were some good arguments against using teams. But what we did is we ran an experiment. And it turned out that using teams, in this sort of context, helped quite a bit. There was also a debate about whether it would be feasible to give people training to help reduce some common psychological biases in human cognition and again we didn’t know for sure what the answer would be but we ran experiments and we found out that it was possible to get a surprising degree of improvement by training people, giving people tutorials that warned them against particular biases and offered them some reasoning strategies for improving their accuracy. So, we did a lot of things that some psychologists or other people in the social sciences might have disagreed with, and we went with the experimental results.

[MUSIC: Nicole Reynolds, “When We Meet Again” (from This Arduous Alchemy)]

DUBNER: Give me now some summary stats on the Good Judgment Project’s performance overall. First of all, how long did the tournament end up lasting, Phil?

TETLOCK: The tournament lasted for four years.

DUBNER: OK. How many questions did IARPA pose?  

TETLOCK: Roughly 500 questions were posed between 2011 and 2015, inclusive.

DUBNER: And your team, the Good Judgment Project, gathered approximately how many individual judgments about the future?

TETLOCK: Let’s see: thousands of forecasters, hundreds of questions, forecasters often making more than one judgment per question because they have opportunities to update their beliefs. I believe it was in excess of one million.

DUBNER: OK. And how’d you do?

TETLOCK: Well, we managed to beat IARPA’s performance objectives in the first year. IARPA’s fourth-year objective was doing 50 percent better than the unweighted average of the crowd, and our best forecasters and best algorithms were out-performing that even after year one. And they continued to out-perform in years two, three and four. And the Good Judgment Project was the only project that consistently outperformed IARPA’s year- one and two objectives, so IARPA decided to merge teams, essentially. So the Good Judgment Project was able to absorb some really great talent from the other forecasting teams. And each year, at the end of the year, we creamed off the top two percent of forecasters and we called them superforecasters. So the top two percent of roughly 3,000 forecasters would be about what 60 people or so. And the next year and the next year and on it would go.

DUBNER: So, the way you’re describing the success of the Good Judgment Project now in your kind of measured academic tone of voice sounds pretty measured and academic. But let’s be real, you kicked butt, yes?

TETLOCK: Yep. Fair enough.  

DUBNER: And what did IARPA do, or how did they respond to the success of your team —in addition to, I assume, “Congratulations,” did they want to, I don’t know, hire a bunch of your superforecasters, or you?

TETLOCK: I have heard people in the intelligence community express an interest in potentially hiring some superforecasters. I don’t know whether they have or not. Our superforecasters tend to be gainfully employed. But some of them might have been interested in that.

*     *     *

[MUSIC: Justin Dodge, “Dextrous” ]

After several years of overseeing the Good Judgment Project — and, now, its commercial spinoff, Good Judgment Inc. — Philip Tetlock has come to two main conclusions. The first one: “foresight is real.” That’s how he puts it in his book, Superforecasting. The other conclusion has to do with what sets any one forecaster above the crowd. “It’s not really who they are,” Tetlock writes. “It is what they do. Foresight isn’t a mysterious gift bestowed at birth. It is the product of particular ways of thinking, of gathering information, of updating beliefs. These habits of thought can be learned and cultivated by any intelligent, thoughtful, determined person.”

DUBNER: OK, so you ran this amazing competition, a long series of experiments, in which you identified these people who were better than the rest at predicting, in this case, mostly geo-political events. And what we really want to know is – again, as nice as that is, congratulations Dr. Tetlock, etc. etc. — we want to know what are the characteristics of the superforecasters. Because we all want to become a little bit more of one. So, would you mind walking us through some of these characteristics, Phil? Let’s start with — what about their philosophical outlook? A superforecaster tends to be what, philosophically would you say?

[MUSIC: Tim Besamusca, “Wars Between The Stars Theme”]

TETLOCK: They’re less likely than ordinary people, regular mortals, to believe in fate, or destiny. And they’re more likely to believe in chance.  You roll enough dice enough times and improbable coincidences will occur. Our lives are nothing but a quite improbable series of coincidences. Many people find that a somewhat demoralizing philosophy of life. They prefer to think that their lives have deeper meaning. They don’t like to think that the person to whom they’re married, they could have just as easily have wound up happy with 237,000 other people.

DUBNER: What about their level of, let’s say, confidence or even arrogance. Is a superforecaster arrogant?

TETLOCK: I think they’re often proud of what they’ve accomplished, but I think they’re really very humble about their judgments. They know that they’re just often very close to forecasting disaster. They need to be very careful. I think it’s very difficult to remain a superforecaster for very long in an arrogant state of mind.

DUBNER: So would you say that humility is a characteristic that contributes to superforecasting then or do you think it just kind of travels along with it?

TETLOCK: I think humility is an integral part of being a superforecaster, but that doesn’t mean superforecasters are chickens who hang around the maybe zone and never say anything more than minor shades of maybe. You don’t win a forecasting tournament by saying maybe all the time. You win a forecasting tournament by taking well-considered bets.  

DUBNER: OK, so let’s talk about now their abilities and thinking styles. A superforecaster will tend to think in what styles?

TETLOCK: They tend to be more actively open-minded. They tend to treat their beliefs not as sacred possessions to be guarded but rather as testable hypotheses to be discarded when the evidence mounts against them. That’s another way in which they differ from many people. They try not to have too many ideological sacred cows. They’re willing to move fairly quickly in response to changing circumstances.

DUBNER: What about numeracy? Background in math and/or science and/or engineering? Is that helpful, important?

TETLOCK: They’re not — there are a few mathematicians and statisticians among the superforecasters, but I wouldn’t say that most superforecasters know a lot of deep math. I think they are pretty good with numbers. They’re pretty comfortable with numbers. And they’re pretty comfortable with the idea that they can quantify states of uncertainty along a scale from 0 to 1.0, or 0 to 100 percent. So they’re comfortable with that.  Superforecasters tend to be more granular in their appraisals of uncertainty.

DUBNER: And what about  the method of forecasting? Can you talk a little bit about methods that seem to contribute to superforecasters’ success?

TETLOCK: One of the more distinctive differences between how superforecasters approach a problem and how regular forecasters approach it is that superforecasters are much more likely to use what Danny Kahneman calls the outside view, rather than the inside view. So, if I asked you a question about whether a particular sub-Saharan dictator is likely to survive in power for another year, a regular forecaster might get to the job by looking up facts about that particular dictator in that particular country, whereas the superforecasters might be more likely to sit back and say, “Hmm, well, how likely are sub-Saharan dictators who’ve been in power x years likely to survive another year?” And the answer for that particular question tends to be very high. It’s in the area of 85, 95 percent, depending on the exact numbers at stake. And that means their initial judgment will be based on the base rate of similar occurrences in the world. They’ll start off with that and then they will gradually adjust in response to idiosyncratic inside-view circumstances. So, knowing nothing about the African dictator or the country even, let’s say I’ve never heard of this dictator, I’ve never heard of this country, and I just look at the base rate and I say, “hmm, looks like about 87 percent.” That would be my initial hunch estimate. Then the question is, “What do I do?” Well, then I start to learn something about the country and the dictator. And if I learn that the dictator in question is 91 years old and has advanced prostate cancer, I should adjust my probability. And if I learn that there are riots in the capital city and there are hints of military coups in the offing, I should again adjust my probability. But starting with the base-rate probability is a good way to at least ensure that you’re going to be in the plausibility ballpark initially.

DUBNER: What about the work ethic of a superforecaster? How would you characterize that?

TETLOCK: You don’t win forecasting tournaments by being lazy or apathetic. You have to be willing to do some legwork and learn something about that particular sub-Saharan country. It’s a good opportunity to learn something about a strange place and a strange political system. It helps to be curious. It helps to have a little bit of spare time to be able to do that. So that I guess implies a certain level of socioeconomic status and flexibility.

DUBNER: And what about I.Q.?

TETLOCK:  I think it’s fair to say that it helps a lot to be of somewhat above-average intelligence if you want to become a superforecaster. It also helps a lot to know more about politics than most people do. I would say they’re almost necessary conditions for doing well. But they’re not sufficient, because there are plenty of people who are very smart and close-minded. There are plenty of people who are very smart and think that it’s impossible to attach probabilities to unique events. There are plenty of reasons why very smart people don’t ever become superforecasters and plenty of reasons why people who know a ton about politics never become superforecasters.

It is very hard to become a superforecaster, Tetlock makes clear, unless you have a very good grip on probability.

TETLOCK: We talk in the book with a great poker player, Aaron Brown, who’s the chief risk officer of AQR.

AQR is an investment and asset-management firm in Greenwich, Ct.

TETLOCK: He defines the difference between a great poker player, a world-class poker player and a talented amateur as: the world-class player knows the difference between a 60/40 proposition or a 40/60 proposition. Then he pauses and says, “no more like 55/45, 45/55.” And of course you can get even more granular than that in principle. Now, when you make that claim in the context of poker, most people nod and say, “Sure, that sounds right,” because in poker you’re sampling from a well-defined universe. You have repeated play. You have clear feedback. It’s a textbook case where the probability theory we learned in basic statistics seems to apply. But if you ask people, “What’s the likelihood of a violent Sino-Japanese clash in the East China Sea in the next 12 months?” Or another outbreak of bird flu somewhere? Or Putin was up to more mischief in the Ukraine, or Greece might begin flirting with the idea of exiting the Eurozone? If you ask those types of questions, most people say, “How could you possibly assign probabilities to what seem to be unique historical events?” There just doesn’t seem to be any way to do that. The best we can really do is, use vague verbiage, make vague-verbiage forecasts. We can say things like, “Well, this might happen. This could happen. This may happen.” And to say something could happen isn’t to say a lot. We could be struck by an asteroid in the next 24 hours and vaporized. 0.0000.1 percent. Or the sun could rise tomorrow. 0.99999 percent. So “could” doesn’t tell us a lot. And it’s impossible to learn to make better probability judgments if you conceal those probability judgments under the cloak of vague verbiage.

*     *     *

[MUSIC: Junebug, “Wish It Away” (from Junebug)]

DUBNER: Let me ask you this: if you were asked to introduce one question into an upcoming presidential debate, let’s say, that you feel would give some insight  via the candidate’s answers, the insight into their views overall on forecasting — our limits and the need for it — what kind of question would you try to ask?

TETLOCK: What a wonderful question that is. You’ve taken me aback, it’s such a good question. I’m going to have to think hard about that. I don’t have an answer right off the top of my head, but I would love to have the opportunity to draft such a question. It would be something along the lines of: Would it be a good thing for the advisors to the President to make an effort to express uncertainty in numerical terms and to keep record of how accurate or inaccurate they are over time? Would you like to have presidential daily briefings in which instead of the documents saying this “could” or “might” or “may” happen, it says, “Our best analysts, when we crowdsource, the probability seems to range somewhere between .35 and .6.” That’s certainly, that’s still a pretty big zone of uncertainty but it sure is  lot better than “could,” which could mean anything from 0.01 to 0.99.

DUBNER: Now, can you imagine anyone saying they wouldn’t want that, though? Do you think there are those who would want to show they’re so, whatever, macho that, “No, no, no, no we don’t want to traffic in that.”

TETLOCK: I think there’s vast variation among politicians and how numerate they are and how open they are to thinking of their beliefs as gradations along an uncertainty continuum rather than expressions of tribal loyalties. We have the story in the book about President Obama making the decision about going after Osama Bin Laden and the probability estimates he got about Osama’s location, and how he dealt with those probabilities. The probabilities ranged from about, I don’t know, maybe from about .4 to about .95 with a center gravity around .75. And the President’s reaction was to shrug and say, “Well I don’t know what to do with this. It feels like a 50/50 thing, like a coin toss.” Now, that’s an understandable reaction from a president who is about to make an important decision and feels he’s getting somewhat conflicting advice and feels like he doesn’t have closure on a problem. It’s a common way to use the language. But it’s not how the President would have used the language if he’d been sitting in a TV room in the White House with buddies watching March Madness and Duke University is playing and someone says, “What’s the likelihood of Duke winning this game?” and his friends offer probabilities ranging from 0.5 to about 0.95 with a center of gravity 0.75 once again. He wouldn’t say, “Sounds like 50/50.” He’d say, “That sounds like three to one.” Now, how much better decisions would politicians make if they achieved that improvement in granularity, accuracy, calibration? We don’t know. I think that if the intelligence community had been more diffident about its ability to assign probability estimates, the term “slam dunk” probably wouldn’t have materialized in the discourse about weapons of mass destruction in Iraq. I think the actual documents themselves would have been written in a more circumspect fashion. I think there were good reasons for thinking Saddam Hussein was doing something suspicious. I’m not saying that the probability would have been less than 50 percent. The probability might have been as high as 85 percent or 80 percent but it wouldn’t have been 100 percent.

DUBNER: But I wonder how much of this is our fault — “our” meaning the public. Because you know when someone makes a decision that turns out poorly, not wrong necessarily but poorly, even if the odds were very much in his or her favor, we punish them for the way that turned out. I mean, forget about politics, go to something as silly as football. If a head football coach goes for it on fourth down when all the probability is encouraging him to do so, and his team doesn’t make it, we know what happens. All the sports fans come out and say, “This guy was an idiot. What the hell was he doing? He didn’t properly calculate the risk.” Whereas in fact he calculated the risk exactly right and maybe there was an 80 percent probability of success and he happened to hit the 20 percent. So, we don’t respond well to probabilistic choices. And maybe that’s why our leaders don’t abide by them.

TETLOCK: That’s right. I mean, part of the obstacle is in us. We’ve met the enemy and the enemy is us. We don’t understand how probability works very well. We have a very hard time taking the outside view toward the forecast we make and the forecast other people make. And if we did get in the habit of keeping score more we might gradually become a little more literate.

*     *     *

So who are these people – these probability-understanding, humble, open-minded, inside-view people – that have the power of superforecasting?

BILL FLACK: Until I got into grad school I was used to being the smartest person in the room. And grad school very quickly disabused me of that notion.

[MUSIC: Mizimo, “The Path of Least Resistance” (from The Path of Least Resistance)]

That’s one of them. His name is Bill Flack. He’s a 56-year-old retiree in rural Nebraska. And he is a superforecaster with the Good Judgment Project — one of the top two percent. Flack studied physics in college, got a masters in math. And even though he wanted to get his Ph.D. …

FLACK: I just came to realize that I didn’t have the either the mental power or the commitment to the subject to pursue a Ph.D.

As smart as he is, Flack admits he is not very worldly.

FLACK: I often don’t read the newspaper at all, and when I do it’s generally the Omaha World-Herald, which isn’t remarkable for its foreign-policy coverage.

Flack wound up working for the U.S. Department of Agriculture. He was semi-retired when he first read about the Good Judgment Project.

FLACK: Basically, I thought it sounded kind of interesting, like “might be fun to try.”

MARY SIMPSON: It’s an area that’s always been interesting to me — how people make decisions.

And that is Mary Simpson, another of Tetlock’s superforecasters.

SIMPSON: I grew up in San Antonio, Texas.  And spent my first 18 years there, had a typical suburban family — older brother, younger sister, stay-at-home mom.  Dad was an engineer.  You know, the typical breadwinner.  And I went to college in Dallas at Southern Methodist University and that was the time when a lot of women were discovering that they could do things besides get married and have children. So I sort of broadened my horizons, found economics, and was really interested in it and decided I wanted to do something besides get married and have kids.  I finished a Ph.D. from Claremont Graduate School and I went to work for the big local public utility Southern California Edison as an assistant economist.

That’s where Simpson was still working when she got involved with the Good Judgment Project. It was just a few years after the financial crash, which Simpson had failed to foresee.

SIMPSON: I had totally missed the 2007-2008 financial crash.  I had seen bits and pieces. I knew that there was certainly a housing bubble. But I did not connect any of the dots to the underlying financing issues that had really created the major disruption in the financial industry, and you know, the subsequent Great Recession.   

Simpson didn’t think her forecasts for the Good Judgment Project would be much better.

SIMPSON: You know, it’s one of those things where I’m a very analytical person, always decent in math, and learned over the years how to kind of assess situations and make predictions.  On the other hand, I’m fairly skeptical of forecasting.  My company spent thousands of dollars every year for the best in the class of economic forecasts — uh, that’s what they were.  We had to forecast. We had to understand where sales would go and be able to make predictions in order to be sure that there was enough power and to assess revenue levels and cost of electricity, and so forth.  So we relied on forecasts, but they were often wrong. So, again, I was hopeful to do a decent job but also very skeptical of the ability of anyone to forecast in certain arenas, especially.

Simpson, like Bill Flack, got involved in the forecasting tournament mostly for fun.

SIMPSON: I was only working part-time and felt like I needed to keep my brain engaged.

It was a volunteer position; they weren’t being paid by the Good Judgment Project. Though they did get …

FLACK: an Amazon gift certificate.

What was it worth?

SIMPSON: A couple hundred dollars. It was not a lot.

FLACK: If you took the value of the Amazon gift certificate and divide it by the hours we put into it we were getting something like twenty cents an hour.

[MUSIC: C-Leb the Kettle Black, “The Celebration” (from The Kettle Black)]

So here were a couple of non-experts in the realm of geopolitics being asked to make a series of geopolitical predictions.

FLACK: I didn’t have any background and had to learn it all from the start.

SIMPSON: I really had very little expertise in terms of international events.

FLACK: Pretty much every single question, I had to dig for background information.

SIMPSON: You need to the understand facts on the ground, you need to understand the players, what their motives are.

FLACK: Spent a lot of time with Google News, some time with Wikipedia, which I mostly used as a source of sources basically.

SIMPSON: You know, I have an analytical bent.  I’m interested in doing research.  

FLACK:  And, you know, pretty much had to educate myself up on the subject.

SIMPSON: A lot of it is the work.  You have to do the work, you have to update, you have to really stay engaged.  And if you simply answer the questions once and let them go and don’t look at them again, you’re not going to be a very good forecaster.

TETLOCK: One of the unusual things about how questions are asked in forecasting tournaments is that they’re asked extremely explicitly. It’s not just, “Will Greece leave the Eurozone?” There are very specific meanings to what leaving the Eurozone means and there’s a very specific timeframe within which this would need to happen.

SIMPSON: It’s not simply answering “yes” or “no” on a question. The answer had to be, “What is your expectation of this event happening?”  In other words, is it 50 percent or is it 90 percent? So, you know, there was a certain amount of effort to figure out, “Well, what’s a good probability?”

FLACK: Each of us learned from previous questions how, you know, whether they were being overconfident and underconfident on specific types of questions. We were getting pretty much constant feedback; every time a question resolved we knew whether we were right or wrong, whether we’d been overconfident or underconfident.  And we tried to look back and see what we had — on questions where we’d gone wrong, how we’d gone wrong; on questions where we’d done well, what we’d done right.  Were we lucky?  Had we followed a very good approach that we should apply to other questions?

And so these typical Americans with no foreign-policy experience whatsoever wound up making remarkably accurate forecasts about things like the Grexit, or whether there would be conflict in the South China Sea.

FLACK: One of the things I liked about Good Judgment was it gave me a pretext to learn about these various foreign-policy issues.

SIMPSON:  I think there’s certain satisfaction in knowing that you’re actually helping research that will hopefully lead to better assessments and better forecasts on the part of government.

FLACK: Certainly I’ve gotten a good deal less patient with the pundits who issue forecasts where, “Well, this could happen,” but don’t attempt to assign a probability to it, don’t suggest how it could go the other way.  You probably won’t like this answer but I’ve grown much less fond of radio news because in trying to make forecasts I’ve been really looking for details. And it annoys me greatly when the radio starts a story about something that could be interesting and then they go into anecdotes instead.  Public radio is as bad as the rest, I’m afraid.

[MUSIC: Danny Massure, “Mama Didn’t Lie” (from What It Is)]

Not today, friend-o! We are all about the details. For instance, here are what Philip Tetlock calls the Ten Commandments for Aspiring Superforecasters:

1: “Triage. Focus on questions where your hard work is likely to pay off.” Pretty sensible.

2: “Break seemingly intractable problems into tractable sub-problems.” OK, no problem.

3: “Strike the right balance between inside views and outside views.”

4: “Strike the right balance between under- and overreacting to the evidence.”

5: “Look for the clashing causal forces at work in each problem.” That’s where the homework comes in, apparently.

6: “Strive to distinguish as many degrees of doubt as the problem permits but no more.” OK, that one just sounds hard.

7: “Strike the right balance between under- and overconfidence, between prudence and decisiveness.”

8: “Look for the errors behind your mistakes but beware of rearview-mirror hindsight biases.” Did you get that one? Here, let me read it again: “Look for the errors behind your mistakes but beware of rearview-mirror hindsight biases.”

9: “Bring out the best in others and let others bring out the best in you.” Not a very Washington, D.C. concept, but what the heck.

10: “Master the error-balancing bicycle.” Wha? This one needs a bit more explanation:

“Just as you can’t learn to ride a bicycle by reading a physics textbook,” Tetlock writes, “you can’t become a superforecaster by reading training manuals. Learning requires doing, with good feedback that leaves no ambiguity about whether you are succeeding … or … failing.” Now, if following these commandments sounds like a lot of work – well, that’s the point.

DUBNER: What your book proves, among a lot of things that are interesting, I think the most fascinating, the most uplifting really is that this is a skill or maybe set of skills that can be acquired or improved upon, right? The people who are better than others at forecasting are not necessarily born that way. Not born that way at all, correct?

TETLOCK: I think that’s a deep truth, a deep lesson of the research that we conducted. Sometimes I’m asked, “how is it that a group of people, regular citizens who didn’t have access to classified information working part time, were able to generate probability estimates that were more accurate than those generated by intelligence analysts working full-time jobs and with access to classified information. How is that possible?” And I don’t think it’s because the people we recruited are more intelligent than intelligence analysts. I’m pretty sure that’s not true. I don’t think it’s even because they’re more open-minded. And it’s certainly not because they know more about politics. It’s because our forecasters, unlike many people in Washington, D.C., believe that probability estimation of messy real-world events is a skill that can be cultivated and is worth cultivating. And hence they dedicate real effort to it. But if you shrug your shoulders and say, “Look, there’s no way we can make predictions about unique historical events,” you’re never going to try.

[MUSIC: Human Factor, “You Know The Feeling” ]

Philip Tetlock has been running forecasting tournaments for roughly 30 years now. And the success of the Good Judgment Project has dictated his next move.

TETLOCK: It has led me to decide that I want to dedicate the last part of my career to improving the quality of public debate. And I see forecasting tournaments as a tool that can be used for that purpose. I believe that if partisans in debates felt that they were participating in forecasting tournaments in which their accuracy could be compared against that of their competitors, we would quite quickly observe the depolarization of many polarized political debates. People would become more circumspect, more thoughtful and I think that would on balance be a better thing for our society and for the world. So, I think there are some tangible things in which the forecasting technology can be used to improve the technology of public debate if only we were open to the possibility.

*     *     *

Coming up next week on Freakonomics Radio, it happens all the time: Some company or institution, maybe even a country, does something you don’t like. So you and maybe a few million friends of yours decide to start a boycott. This leads to a natural question: do boycotts work?

*     *     *

Freakonomics Radio is produced by WNYC Studios and Dubner Productions. This episode was produced by Arwa Gunja. Our staff also includes Jay CowitMerritt JacobChristopher WerthGreg RosalskyKasia MychajlowyczAlison Hockenberry and Caroline English. Thanks to the Carolina Panthers Radio Network for providing audio for this episode. You can now hear Freakonomics Radio on public-radio stations across the U.S. If you’re one of our many international podcast listeners — well, you should probably just move here. Or at least listen to our recent episode on open borders, called “Is Migration a Basic Human Right?

If you want more Freakonomics Radio, you can also find us on Twitter and Facebook and don’t forget to subscribe to this podcast on iTunes or wherever else you get your free, weekly podcasts.

 

Here’s where you can learn more about the people and ideas behind this episode:

PEOPLE

RESEARCH

ETC.


David J.

I loved this podcast. In my work, predictions are made with no accountability of past bad predictions. These bad prediction can lead to millions of dollars in overruns but are never tied to the original predictors but the members who are executing the plan. I wish I could meet with all brilliant minds that you meet with each week. Since I can't, thanks for letting us listen in! -David

Charles Mann

The subject of the podcast was interesting but I found it quite lacking in terms of concrete data.

Is there a white paper or some other source for information on what made the superforecasters better than a coin flip or dart board? What predictions or forecasts did they nail that other folks did not? Did they score better on a series or predictions? How do we know that these folks weren't simply "lucky" and might regress to the norm with their next series of predictions?

ch33p

I thought about the exact same thing. If you have 3600 people making predictions you are bound to have 60 people that outperfome even really good forecaster.
See also managed portfolios.

RachelM

I would love to see this kind of rigour applied to Environmental and Social Impact Assessments. Particularly for social issues, it’s really challenging to understand and predict impacts but there is rarely an ex-post evaluation of how accurate the assessment was or which impacts occurred that were not predicted. I suspect damage is being caused that could easily be avoided, and that a lot of money is being spent on the wrong issues.

RoboCoach

If the host introduces his episode with a swipe at the Bush administration...how likely is the host a leftist? Pretty high.

Tim

I took it as a swipe at the inaccuracy of the intelligence community, in the post 9/11 environment. Which are a matter of public record - not some leftist conspiracy.

This same intelligence community reflected on those failings, and started these forecasting projects.

To learn and share, and become more accurate at forecasting and providing data to policy makers.

As Tetlock says, the greatest benefit of these sorts of exercises will be to increase the level of public debate, on very real issues, that have an impact on many people.

Instead of relying on Left (BBC) or Right (FOX) punditry to determine our discourse and our decisions.

I really enjoyed the episode - thanks to all involved.

James

The problem here is that these "predictions" are still nothing more than more-or-less informed guessing. There's no testable theory underlying the predictions. When NASA launches space probes that reach their destinations years and hundreds of millions of miles later (and know where to point their cameras, too!), do you think they go by inspired guesswork? Likewise with everyone's favorite, weather prediction. Meterologists aren't just guessing: they have theory & calculations that not only predict, but say what the accuracy limits of the predictions are.

Larry

TLDL - Come up with a baseline (probably by doing a ton of research) and then adjust it (by doing another ton of research). Thanks Steve!

Coming up with an anecdote of experts picking incorrectly (the football game) in what was probably close to a coin flip doesn't tell us much.

Oh, and that poker quote is very wrong.

Paul W.

It's a bit disappointing that there was no discussion of surviorship bias. It sounds like the definition of this given how the superforecasters were selected. What was the probability that you would end up with these people given how many people participated in the project?

Kevin B

The podcast was interesting but, in my opinion, came up short. The anecdotal introduction is hugely problematic and made the same mistake many people (including bad poker players) make, which is believing the outcome of a single sample dictates the “correctness” of probabilistic predication (or in the poker case, the outcome of a particular hand). If there was truly a 55% chance the Seahawks would win that game then every expert should have picked them. And, even though they picked correctly, there was a 45% chance they would all be wrong. The introduction suggested that because very few people picked Carolina to win that meant they had almost no chance. That just isn’t the case. If you are given 1:1 odds (like the TV pundits, simply +1 if you are right and -1 if you are wrong) for a game that is 70/30, no one picks the 30… In this case, the one game example (today’s loss not withstanding) means almost nothing. That isn’t to say the talking heads know what they are talking about, it simple means the example does not demonstrate that they don’t.

I would be interested in the long term success of the “super predictors” and see if they continued to have a higher level of success than their peers even after they had been identified as top performers. Considering they are predicting one-off events, and there were thousands of people participating, even if they did simply flip a coin you would expect a small handful to do quite well; although I am sure that there is a significant difference between some people in their ability to identify accurate likelihoods. So, I would like to know more about how they were scored and if the percentages they placed with the events had an effect on their scores. (This very well may have been covered, but I was dealing with a crying baby for part of the podcast…)

Overall I feel the discussion did not hammer home that the ability to predict the likelihood of an event and the ability to predict the event itself are related but rather different. I was also disappointed that risk was not discussed regarding the use of predictions for policy decisions. Just because there is a 90% chance that something will not happen doesn’t mean you don’t want to act on the 10% chance it will, it all depends on the cost of being wrong 10% of the time versus the cost of being wrong the 90% (it is why we take bomb threats seriously even though the odds of an actual bomb tend to be low, in this case if you are going to be wrong you’d prefer to error on the side of caution, even if it isn’t an “error” when probability and risk are taken into account).

(As a side note: the poker analogy, as someone else pointed out, is terrible. As described, that is the difference between an amateur who loses quickly and an amateur who probably still loses but more slowly…)

Read more...

Kevin B

My previous comments were a little harsh, so I just wanted to add that I am glad they brought up the topic, it is an interesting one. Obviously I felt there were short comings, but it was well worth the time to listen, it made me rethink a couple things as well. I am looking forward to the next podcast!

TimBoyer

Great episode, but left me wondering something. How are their successes measured?

For instance, if I say there's a 55% of a Greek default next week, and someone else says there's a 65% chance, and Greece defaults, how is her forecast measured to be better than mine? We were both correct. Wouldn't the logical thing to do be always predict either a 100% or a 0%? If you're right, you're then extremely right.

I'm sure there's a methodology; I'm just curious what it is.

Charles D. dietzen

Dr. Tetlock describes forecasters assigning probabilities to possible future occurrences.
He also describes grading the accuracy for both scoring (tournaments) and feedback to improve one's forecasting ability. My question is how does one define success and failure?

In trying to assess one's accuracy, does a forecaster score a plus (success) for any event which occurs (does not occur) and there was made a prediction probability of greater than 50% occurrence (non-occurrence) and a minus (failure) for any event which occurs (does not occur) and there was made a prediction probability of less than 50% occurrence (non-occurrence)?

Earl Baker

I really enjoyed this podcast. It raised some very interesting questions and introduced me to some intriguing concepts. But I would have liked a little more specificity on the numbers. For example:

1. How accurate were the SuperForecasters?

2. For the relatively unique events being evaluated, how do we know whether the 70% prediction was more accurate than the 60% prediction (or vice versa) if they both predicted correctly?

3. Are the "probability" numbers being assigned by the forecasters really probabilities? Or might they more accurately be measuring the confidence level of the forecaster? I would like to know, for example, how frequently events rated as 65% probable by the Superforecasters actually occurred? Did they occur at a 65% rate? Did the 80% events occur at an 80% rate? (I have been reading about the Brier scoring used in the project, and while I agree it probably sorts out the good forecasters from the bad, I'm not sure it is really getting to the accuracy of the probability predictions. But I am not advanced enough in my mathematics to get my arms completely around it.)

4. And finally, I was very interested in this idea that the forecasters could change their predictions multiple times as new information becomes available. Does the last of several modified predictions erase the initial prediction? Is the prediction made the day before the final deadline weighted the same as the prediction made a year earlier? It seems there may be some room for gaming the system in allowing changes to be made to the predictions.

Anyway, I did enjoy the show. It really made me think.

Read more...

GustavoLuken

I'm a recent monetary contributor to Freakonomics because it just needs to continue on forever. Excellent work you guys do. I'm amazed at the questions Mr. Dubner asks when he interviews, and I was wondering if anyone has ever thought about asking the right questions in life, in our daily living. just throwing it out there, it would help everyone save time, im a Medical Doctor and finding the correct questions to ask my patients is crucial. Does it depend on what we are looking for? who we ask? what time of day? year etc, etc or just be a very good journalist, like Mr. Dubner.