Episode Transcript
As an academic, I did a lot of research trying to catch bad actors. Everyone from cheating teachers, to terrorists, to sumo wrestlers who were throwing matches. What I didn’t do much of, though, was to try to catch cheating academics. My guest today, Uri Simonsohn, is a behavioral science professor who’s transforming psychology by identifying shoddy and fraudulent research. Watching his exploits makes me wish I’d spent more time doing the same.
SIMONSOHN: We talk about red flags versus smoking guns. So red flags, that gives you, like, probable cause, so to speak. But it’s not enough to raise an accusation of fraud. That’s a smoking gun.
Welcome to People I (Mostly) Admire, with Steve Levitt.
* * *
Uri Simonsohn and two other academics, Joe Simmons and Leif Nelson, run a blog called Data Colada, where they debunk fraud, call out cheaters, and identify misleading research practices. Uri, on his own, has been doing this work for over a decade. My Freakonomics friend and co-author, Stephen Dubner, spoke to the Data Colada team for a series about academic fraud that ran on Freakonomics Radio in 2024. But I admire Uri and his collaborators so much that I wanted the chance to talk to him myself. I started the conversation by asking about the research study that got him started in this direction. The study that he read and said to himself, “My God, this is outrageous. I just can’t take it anymore!”
SIMONSOHN: The first time I ever did any sort of commentary or criticism, I was asked to review a paper for a journal. What they were studying was the impact of your name on your big life decisions. The paper began with something like, It’s been shown that people with the same initial are more likely to marry each other than expected by chance. In this paper we checked whether they have similar divorce rates — or something like that. And the idea was, Oh, maybe if you marry somebody just because they have your same initial, you’re less compatible than if you follow more traditional motivation for marriage. And I thought, No way. And so I stopped reviewing and I went to the original paper. And it’s not that I thought, There’s no way that your name impacts these decisions. I could imagine some mechanisms where like you’re talking to somebody, and then they happen to share a name with you. And that can be a nice icebreaker, and it would lead to a relationship. But what I thought is, How in the world do you study this credibly? And so that led me to an obsession. I went to the original and I thought, Okay, clearly it’s going to be some ethnic confound, for example. ‘Cause different ethnicities have different distributions of last names, right? So very few South American people have a last name starting with a W, but many Asian people do. And so because Asians marry Asians and South Americans marry South Americans, that could explain it. But it was better than that, the original study. So it took me a while and then I figured out what the problem was. The very first one I checked was with same initial last name. So the idea is that if your last name starts with an S, you’re more likely to marry somebody else whose name starts with an S. And what I found was that the effect was driven entirely by people who have the exact same last name. So I thought, Why would people have the same last name and be more likely to marry? Like, How would that happen? And a common mechanism is that there’s a couple, they get married, she changes her last name to his, they divorce, and then they remarry each other. Okay? Now, this is rare. This is rare, but because you expect so few people to marry somebody else with the same last name, it’s such a huge coincidence, that even a small share of these people, they can generate an average effect that’s sizable.
LEVITT: Oh, yeah. That’s great. That’s clever. So later you and Joe Simmons and Leif Nelson published a paper in 2011 called “False-Positive Psychology,” and it turned out to be an incredibly influential paper. You and your co-authors highlight specific practices commonly done by researchers that can lead to drawing the wrong conclusions from a randomized experiment. The core of the paper is really made up of simulations that show quantitatively how various researcher choices lead to exaggerated levels of statistical significance. So it appears in a published paper that a hypothesis is true even when it really isn’t. What was the motivation behind writing that paper?
SIMONSOHN: Me, Leif, and Joe we were going to conferences and we were not believing what we were seeing, and we were sticking with our priors. And so then what’s the point? You should read a paper that surprises you, and you should update. It doesn’t mean you should believe with certainty it’s true. But you should update, you should be more likely to believe it’s true than you did before reading the paper. And we were not experiencing that.
LEVITT: It is one of the few academic papers that has caused me to actually laugh out loud because as part of that paper you describe in a very serious way, an actual randomized experiment you, yourself, ran in which you find that listening to the song “When I’m Sixty-Four” by The Beatles actually causes time to reverse. People who listen to that song, they’re almost 1.5 years younger after they listen to the song than before. And obviously that makes no sense at all, which is the whole point. But you report the results in the same scientific tone that pervades all academic papers, and I found it to be hysterically funny. So let me start by giving my best attempt to describe the textbook version of a randomized experiment that’s the gold standard of scientific inquiry. So here’s my attempt: the researcher starts by posing a clear hypothesis that he or she wants to test. So in your “When I’m Sixty-Four” paper, this hypothesis would be that listening to that song causes time to run in reverse, leaving people who listen to it younger after they listen to it than before. And then the researcher poses a second, alternative hypothesis called the null hypothesis to which that first hypothesis is compared. In this case, the null hypothesis would probably be that listening to “When I’m Sixty-Four” does not cause time to reverse.
SIMONSOHN: Right.
LEVITT: Then the researcher maps out a detailed experimental protocol to test these two competing hypotheses. And then using very simple high-school level statistics, you determine whether there are any statistically different changes in ages across the subjects in the two groups. And if I ran that experiment as described, you would be inclined to believe my results, whatever they were.
SIMONSOHN: That’s right.
LEVITT: Okay. So let’s talk now about specifically how you used standard methods employed in the psychology literature at the time to prove that this Beatles’ song reverses time. The first common practice you talk about is measuring the outcome you care about in multiple ways, but only reporting results for the outcome variable that yields the best results.
SIMONSOHN: So, all you need to do is give yourself enough chances to get lucky. We can think of P-values, the idea that something is significant — there’s like a one in 20th chance that you can get lucky.
LEVITT: Right, so the P-value refers to the probability value, or the likelihood of observing the effect you’re studying. In academic publishing for reasons I don’t really fully understand, we’ve anointed the 5-percent level of statistical significance as some kind of magic number, right? So if your story’s not really true, you’d only get the data that looked like this less than 5 percent of the time. Then that somehow magically leads people to say that your theory’s true. If it’s above 5 percent, then we tend to say, “Oh, you haven’t proven yet that your theory’s true.”
SIMONSOHN: So like, suppose you have a friend who says, “I can forecast basketball games,” and they get it right for the one game. You are like, “Well, it was 50-50 that you would get it right. So I’m not impressed.” So then they get two games in a row. It’s like, Oh, okay, that’s more surprising. There’s only a 25-percent chance that if you were just tossing a coin you would get two basketball game guesses correctly. But you’re still not sufficiently convinced. But when they do five in a row, then you think, Oh, maybe they can actually forecast basketball games because it’ll be so unlikely that just by chance you’ll get five in a row that I guess the alternative becomes my candidate theory. So I guess you can predict basketball games. That’s the logic of it. Like at some point you you’re forced to get rid of just chance as the explanation.
LEVITT: Okay. So in this “When I’m Sixty-Four” experiment, what you’re saying is, it was like a friend asked you to predict the outcome of five basketball games in a row, but what you secretly did in the background is you actually predicted the outcome of five basketball games not once, but maybe a hundred times. You had a hundred different series of five basketball games and there was one of them out of those hundred that actually gave you this crazy result. And then you reported that’s what you got.
SIMONSOHN: Yeah. Like, imagine your friend said, “Okay, I’m going to predict five basketball games.” But they’re also predicting five baseball games and five football games, and five whatever games. And then whichever one worked is the one they tell you about. And the ones that didn’t work, they don’t tell you about. So if you have enough dice, even if it’s a 20-sided dye, like only one in 20 chance, if you keep rolling that dye, eventually it’s going to work out. And so the way academics or researchers in general can throw multiple die at the same time and be able to get the significance they’re looking for is they can run different analyses. Which is what we did there. So we were comparing participants’ age across the two groups. So that’s one die, but we actually had three groups. So we had “When I’m Sixty-Four,” we had a control song, which is, a song that comes with Windows as a default in the background. And we also had the “Hot Potato” song, which is a kid’s song. And so that could have had the opposite effect. So we had three, we could have compared “Hot Potato” to control, or we could have compared “Hot Potato” to “When I’m Sixty-Four” or “When I’m Sixty-Four” to control. So right there we had three dice that we were throwing. But we also could adjust our results. So the one we ended up publishing was controlling for how old the participant’s fathers were. Okay. And the logic was, Look, there’s a lot of natural variation in how old people are. And so to be able to more easily detect differences induced by our manipulation, we want to control some of that noise, right? To take them into account. And so one way to take people’s age into account indirectly is to ask how old their parents are. And so in our regression, statistically we take that into account. And when we did that, the effect arose. And why? Because if you do that, now we have three more dyes, right? We can do controlling for father’s age, “Hot Potato” versus control, controlling for father’s age, “Hot Potato” versus “When I’m Sixty-Four,” and so on. And so in the end, we had like many, many different ways we could cut the data and the one that worked is the one we ended up writing about.
LEVITT: The way you just described it, it is completely and totally obvious that you’re cheating. If you test a bunch of outcomes and then you just choose not to tell people about the ones that don’t work and you focus all your attention on the one that does work, you’re obviously misleading people into believing your hypothesis is more powerful than it really is. So how could the academics not realize this was bad science? Do you think they really didn’t understand that this was cheating?
SIMONSOHN: I do. I do because I had many conversations where people were pushing back and saying, “There’s nothing wrong with it.” I think there’s two ingredients to it. One of it is just not knowing the statistics. Most people who take statistics don’t learn statistics. For some reason, it is profoundly counterintuitive to humans. It’s just not how we think. And the other reason is, we’re very good storytellers. And so what happens is, the moment you know what works, you immediately have a story for why it makes sense. I remember the first time I presented the “When I’m Sixty-Four” study, somebody in the audience asked a question jokingly about some decision we had made. My instinct was to immediately defend it. Like, we are just so trained that’s what you do. So I don’t think people were cheating in the sense that they thought it was wrong. They just didn’t know, and they didn’t quite appreciate the consequences. I just want to say, it’s not just psychology. This is very common in clinical research. If somebody is running an experiment, it can be in medicine, in economics, in biology. Like, at the time I was talking to scientists from all fields, and this is a very widespread problem.
LEVITT: Okay, so that’s good to point out because one could easily say, “Well, I’m not that worried if psychologists are messing around.” But when medical researchers are messing around, now you’re actually getting into things people really care about. Okay. Let’s talk about the second misleading research practice that you highlight, and this one’s a lot more subtle than the one we just talked about. The researcher designs an experiment and carries it out. And then he or she looks at the data and sees that the results — Oh, they’re not quite statistically significant. Everything’s going in the right direction, but it didn’t quite reach this magic 0.05 threshold. So it seems sensible in that situation, you say, “Well, look, I just didn’t have enough power.” That’s what we call it in experimental design when you don’t have enough research subjects to actually show that your true hypothesis is really true — You don’t have enough power. And so maybe I’ll just go and add another 15 or 20 observations and I’ll see if it’s significant. Oh, and maybe, again, it wasn’t quite significant. I’ll add 20 more. Boom. I’m over the threshold, and then I stop. Now intuitively, this doesn’t seem nearly as bad to me as not reporting all the outcome variables, but as you show, my intuition is wrong. This is actually a really bad practice. Can you try to explain why in a way people can understand?
SIMONSOHN: Yeah. Let’s think of a sport, let’s say tennis. And let’s say you’re playing tennis with somebody who’s similarly skilled as you are. And so beforehand, if you had to guess who’s going to win, it’s like a 50-50 chance. But suppose we change the rules, and Steve gets to say when the game ends. We don’t play to three sets, we play to whenever you want to end it. Okay, and you’re one of the two players.
LEVITT: Okay.
SIMONSOHN: You may see that now you’re much more likely to win the match.
LEVITT: Yeah, if I win the first point, match over.
SIMONSOHN: That’s right. So if at any point during the game you are ahead, you win the game, and therefore, now the probability is not that you win after three sets, but it’s that you are ahead at any point in the game, and that necessarily has to be much more likely. And so similarly with an experiment, when we do the stats what the math is doing in the background is saying, Well, if you’re committed to 60, how likely is it that after 60 you’ll have an effect that’s strong? But what we should be asking is, What is the likelihood that at any point up to 60, your hypothesis will be ahead by a lot? That’s a question we should be asking, and that’s necessarily much more likely.
LEVITT: Yeah. And the key is that you stop when you win and you keep on going when you’re losing.
SIMONSOHN: Mm-hmm. That’s what introduces the bias. It’s not a random decision. If you were to flip a coin, do I keep collecting subjects or not? Then there will be no problem. The problem is the way you said it, like if you’re losing, you keep playing, but if you’re winning then you end.
LEVITT: It’s unlike the first point where the academics I would talk to about having multiple outcomes, they totally got why that wasn’t legit. But I still can have conversations with experimentalists who will argue with me about this point and say I’m dead wrong. How can I not understand this? This is a good research practice, not a bad research practice. But as you show in the paper, and as the intuition you just described explains, it’s really a bad practice.
SIMONSOHN: We’re not good intuitive thinkers about statistics, especially about conditional probability, which this — it has that flavor. And that’s the source of the problem.
We’ll be right back with more of my conversation with behavioral scientist Uri Simonsohn after this short break.
* * *
LEVITT: What I find so beautiful about this paper is that it is really so simple. It’s so easy to understand. It’s so obvious in a way, and yet a whole field of academics was totally blind to it until you pointed it out. And at least the way I’ve heard the story told, you and your co-authors didn’t think you’d even be able to publish the paper, much less imagine that it would emerge as one of the most important papers published in psychology in the last two decades.
SIMONSOHN: We thought it was uncitable ’cause we thought, How can you cite this paper? In what context? Like you would say, “Well, we didn’t do this weird thing that they talked about,” citation. So we thought, Okay, maybe we thought maybe it’d be influential. It’d be hard to publish and it’ll be un citable. And it’s incredibly cited. like crazy. (SL^Thousands) We were super wrong.
LEVITT: The bottom line, which is really stunning I think to most people who read your paper, is in your simulations, what you find is that when you run through a hypothesis that is not true by design, you’ve built these hypotheses not to be true, but you do all of these different tweaks together. Then over 60 percent of the time, you get statistically significant results for your hypothesis. Okay? These are a hundred percent-false hypotheses that 60 percent of the time lead to the truth. That’s crazy. It surprised me. Did it surprise you how big that number was when you first ran the simulations?
SIMONSOHN: We were floored. That’s when we decided we definitely have to do the paper.
LEVITT: Yeah. So I’ve had a lot of psychologists as guests on this show, and they have reported some truly remarkable findings, and I suspect I should have been more skeptical of them than I was. But it’s also odd to only believe research that confirms your beliefs. It’s a hard line to follow. I guess it’s why we so desperately need credibility in research is that when research is not credible, then you just default to your own intuition. But if you’re just defaulting to your own intuition, you go back to Socrates and Aristotle, you’re no longer empirically driven.
SIMONSOHN: One of the things we’ve been doing is advocating for pre-registration, which means people tell you how they will analyze the data before they actually run the study. So closer to the way you were describing the ideal experiment. And there have been substantive uptake of this idea of preregistration. So when you see the results, you can evaluate the evidence much closer to face value of what the statistics tell you.
LEVITT: So in that initial paper you laid out a simple set of rules for how to create a body of research that’s more credible. And one of them is this pre-registration. Another really simple one is making your raw data available. I think this will amaze people who are outside of academics, but until recently, until after what you did, and in large part probably ’cause of what you did, academics were not expected to let others see their raw data. And that has really been transformational, I think, don’t you?
SIMONSOHN: Yes, that’s very important. It’s easier to check for errors. It’s easier to check for fraud if one is so inclined. And it’s easier to even just look for robustness, like the idea that, Oh yeah, you get it with this particular model, but let me try something else the way I usually analyze the data. Do I also get it that way? So that’s become much more common. I wouldn’t take too much credit for that. The internet is probably a big source of why it’s just easier to upload and share than it used to be.
LEVITT: I’m going to say something controversial now. I have the sense that part of the reason that researchers 10 or 15 years ago were behaving so unscientifically, and, still, researchers are pretty unscientific, is that at some fundamental level, nobody really cares whether the results are true or not. I get the sense that most social scientists see academic publishing as a game of sorts, the goal is to get publications and citations and tenure, and there’s an enormous number of academic papers written each year and a nearly infinite supply of academic journals. So that in the end, very low-quality stuff ends up getting published somewhere. And except for a handful of papers, there’s little or no impact on the broader world that comes out of this research. So it just isn’t so important whether results are right. But when I’ve bounced that hypothesis off of other academics, they usually get really mad at me when I say that. Although I do believe there’s a lot of truth to it. What do you think?
SIMONSOHN: I think there’s truth to it. There’s definitely people who don’t care whether it’s true or not, because what they’re doing is, maybe “game” is too strong a term, but like their job is to publish papers. Their job is not to find new truth that other people can work with. At the same time, ’cause in our blog we’ve done 130 or so, and we focus on at least some criticisms of papers. And we have a policy where we engage with the authors before we make anything public. We send them a draft and we ask them for feedback. Nobody likes it.
LEVITT: Nobody wants to hear from you. That’s a disaster.
SIMONSOHN: So excited to learn what the shortcomings of this paper were. Like, nobody’s in that mood. But beyond that, they seem to really care. Now, it could be they just don’t want to be shown to have been wrong. And there’s some truth to that, but they do seem to really care about the phenomenon. I agree with you that there’s a lot of people that don’t care, but I think the higher up you go on the ladder of more influential researchers and more influential journals, I think they do care. In fact, if anything, I think they have an inflated sense of how important their work is. Not the opposite. They think their work is really changing the world. They don’t think of it as a game. They think their life is so important ’cause they’re really changing things. And part of my sort of motivation is I don’t agree with that. I agree with you in the sense that I think most research is insufficiently valuable. Even most top research is insufficiently valuable. And maybe this is too naive of me, but I think if you make it harder to publish silly things and to publish, like, sexy findings, the only hope then is to study something that’s important. Even if it’s not intrinsically interesting, it’s going to be more important. And so to move social science to be more of a force for greater welfare in society. I don’t think we’re there. I don’t think social science is all that useful at the moment, but I do think it has the potential.
LEVITT: So we’ve been talking so far about how standard methodological practices can lead readers to falsely conclude that a hypothesis is true, because the way things are represented are misleading. Okay. That only gets you so far in the abstract. What we really need is an actual tool that one can apply to a specific body of research to reliably judge whether the findings are credible. And, damn, you and Joe and Leif, you came up with that too. It’s called the P-curve, and it is simultaneously, again, incredibly brilliant and incredibly simple. Can you give the intuition that underlies the P-curve?
SIMONSOHN: Yep. So what it looks like is any one study is going to have a P-value. It’s going to tell you how surprising the findings are. And remember we’re saying if something’s 0.05, it’s significant. We’re drawing this arbitrary line at 0.05, right? And so if you see one study ending at 0.04, that’s okay. That’s significant. If you see one study at 0.06, that’s not significant. But the key insight is, and it’s related to stuff I had done with motivation and goals, is that if you’re aiming for a goal, you’re going to end pretty close to it. So for example, if you’re aiming to run a marathon in four hours, you are not going to run it in three-and-a-half hours. You’re going to run it at 3:58, 3:59. ‘Cause that’s your goal. The moment it’s achievable, you just you stop going. And so the basic idea of P-curve is, if people are trying multiple things to get to 0.05, they’re not going to go all the way to 0.01 or to 0.001. They’re going to stop once they achieve it. So, you start and your P-value’s 0.12. You know you need a 0.05. And so you try something — you control for father’s age, right? And then that gets you to 0.08 and then you drop the “Hot Potato” condition and you end up at 0.04 and then you stop. You don’t keep going ’cause you achieved your goal. So if I see a bunch of results and they’re all 0.04s, then it becomes more likely that you are P-hacking. A lot of academics, across all fields now use this term, P-hacking, which is about how you selective report from all the analysis you did. If there’s a true effect that is significant, you expect very few 0.04s and you expect mostly 0.01s. If you read the literature, you don’t see a lot of 0.01s. You see a lot of 0.04s and 0.03s, right? Which tells you something. So if you give me 20 studies and I look at the P-values and I see 0.04, 0.03, 0.04, 0.03. I should not believe those studies. And if I see 0.01, 0.01, 0.02, I should believe them. And so P-curve just formalizes that. So it takes all the P-values, it does a magic sauce, and it tells you the combination of results that are close to 0.05 compared to far from 0.05, tells you, You should believe it, you should not believe it, or you need more data to know.
LEVITT: Okay, so given this idea of the P-curve, in practice, have you been able to debunk whole literatures by using this concept and show that things that people believed and where lots of papers were published are probably not true at all?
SIMONSOHN: I’ve only done one of that flavor, and it was controversial. I think now it’s not so controversial. It was the Power Posing literature. It was very influential TED talk; people knew that if you assume a powerful post, meaning you expand your body, like imagine somebody raising both of their arms and standing up, then they become more confident, more successful. They would take more risk and things like that. And we applied P-curve to that literature and we found that there was no evidence for it. Maybe 30 published papers on it? And if you look at them, they provided no evidence.
LEVITT: So we’ve been talking so far, mostly about mistakes, possibly well-intentioned, people make that are doing research that seem to have truth in the back of their mind, if not actually taking the steps that are getting them to the truth. But where I think what you’ve done that gets a lot more fun and exciting is going after complete frauds, researchers who are actually outright making up data or faking experiments. So, of all these fraud cases that you’ve been part of, the granddaddy of them all is the Francesca Gino case. Do you want to tell me about that?
SIMONSOHN: Sure. A few years ago, maybe five years ago, four years ago, we were approached by young researchers who wanted to remain anonymous with concerns. I had been in touch with them a few times about fraud. The first paper I had done detecting fraud was about 14-years ago. And so, I had sort of moved on, and I didn’t want to do it anymore. And so this person would approach me and I told them, “Look, unless the person’s very famous or the paper is very influential, I don’t want to get involved.” It’s very draining——
LEVITT: Okay, wait, so you had done some fraud research and then you swore off it. To an outsider, it might seem like, Wow, of all the things that might be really fun and exciting as an academic, it would be revealing some horrible fraud who’s like doing terrible things, and ruining the profession. But you’re saying you didn’t actually enjoy that kind of work?
SIMONSOHN: No, I hated it. I hated it because the first two days are fun and the next year are dreadful.
LEVITT: Because the first two days are the discovery process where you’re actually in the data, you uncover the patterns, and then the rest is the drudgery of being 99.999-percent sure. Because if you’re wrong, you are really in trouble.
SIMONSOHN: It’s not enough to be right in your head with a hundred percent certainty. You need to be certain that others will be certain.
LEVITT: And it’s also a case where you have an adversary, right? Mostly when you do academics, you write something and nobody really cares that much. But when you are saying someone else made up their data, you have created an enemy who will fight you to the death.
SIMONSOHN: But it’s also draining ’cause you become the world expert in an incredibly trivial, small piece of information. Like, as we will talk in a minute — Francesca Gino — we spent a lot of time on her Excel spreadsheets for data. I’m like the world expert in study three that she ran, you know, 14 years ago. I know every row, and that’s just really useless. We’ve talked about like, is research useful or not useful? It’s debatable, but, like, knowing how a particular spreadsheet was generated, it just feels so local. You’re not learning anything. It’s not fun. You get a lot of pushback.
LEVITT: So you had done some research, you’d found fraudsters and in part because you had a reputation for doing this, people would come to you with hot tips, with the idea that fraud was going on. So a stranger came to you with the belief that Francesca Gino was cheating on her research. And this is especially interesting ’cause Francesca Gino is one of your own co-authors in the past, which must have put you in a really interesting and complicated place.
SIMONSOHN: Yep. So we have a paper together. I used to be at Wharton, I’m in Spain now. And we made her an offer when I was there. She ended up going to Harvard instead. So I knew her. And we did have suspicions maybe 10 years ago, and we looked at the data that we had access to. We were subjectively convinced that something was wrong with it, but we didn’t think we could prove it beyond reasonable doubt. And so we dropped it.
LEVITT: So then this young academic came to you, and she had better evidence? She convinced you that you thought you could actually make a case of it?
SIMONSOHN: Yeah. It was two of them and they sent me a report they had written and I thought, This is promising. We talk about red flags versus smoking guns. So red flags is something — your experience with it, it just doesn’t seem right. That gives you like probable cause, so to speak. But it’s not enough to raise an accusation of fraud. That’s a smoking gun. And I think that’s where the report was at that stage. And then we said, “Can we get evidence that it’s sufficiently compelling that anybody looking at it would be immediately convinced?”
LEVITT: Your Data Colada blog team is you and Joe and Leif, how many hours do you think the three of you put together on top of what these other two folks had done to try to push it to that stage?
SIMONSOHN: Hundreds.
LEVITT: Yeah. So big, big investment. And it’s not even so obvious why you’re doing it. It probably did in the end, further your academic career and bolster your reputation. But mostly, this is a task that isn’t rewarded in academics very much.
SIMONSOHN: It’s funny, like, it definitely helped my policy making career. I’m actively engaged in trying to change things in social science. This definitely made it easier to do that. I don’t know that it made it easier publishing my research that is not about fraud. ‘Cause people are happy that whistleblowers exist, but nobody likes whistleblowers. It doesn’t engender, like, warmth, you know? I’m happy you exist, but I’m going to talk to somebody else during my lunch. But for my intentions to try to influence science and to have more credible research, this has been very good. We’ve received funding. We’re about to launch a new platform that seeks to prevent fraud instead of detecting it. And that only was made possible by the attention that this case received.
LEVITT: Okay, so just to foreshadow what’s going to happen. So she is going to be fired from Harvard, her tenure removed, as a result of this evidence you are collecting. So just to give listeners a flavor, what’s the clearest evidence? Of all the things that you found in her data, what to you said, “My God, no way this could happen anyway except for outright fraud?”
SIMONSOHN: My favorite is — big picture people were rating how much they like an event that they participated in one to seven, how much did you like it? Okay. And she needed people to like it less in one condition than the other. And in fact, that’s what happened. And we proposed that the numbers had been changed. Somebody had said they like it at seven, but in the data they appear as a one. This was a red flag that they came to us with the junior researchers. ‘Cause they were looking at the distribution of liking numbers and they said, “Look, there’s just this whole mass of people who are giving all sevens and they entirely disappear.” And that’s true. It’s surprising that you move a bunch of sevens to one and to twos. But what do I know? This is a weird task. I don’t know those people. I don’t know how people use the scale. So it was a red flag. It wasn’t a smoking gun. And so what I told them, I said, “Look, if the numbers were changed, there may be a trail in the data in the other columns, in the columns that weren’t changed. There should be a mismatch.” And so what I was thinking when I said that was something like gender. So you imagine that in general women like this thing more than men, but those people that were changed, you don’t see that effect for women or something like that. That’s what I was expecting. But we found like a gold mine, which was, there was a column where people had been asked to describe in words what the event was like. Okay. And so people used words like, That was great. I loved that. Best time of my life. Or, I hated it. I felt yuck — because it was a networking event. I felt disgusting selling myself in this event. And so the idea was, Oh, so maybe if the numbers were changed, there’ll be a disconnect between those words, describing the very same event and the numbers summarizing the event. Okay. And so we looked at those suspicious numbers that we would’ve expected to be all ones and were all sevens — sevens, meaning they hated it. And you look at the words, and the words said, “Best thing ever. Loved it.” Okay. And then you looked at the other side, the people who gave ones that we thought were sevens, and they said, “I felt disgusted.” Our hypothesis was, Those values were changed. So what we tried to do, and this is why we reached out to Harvard originally, was like, Look, if you go to the Qualtrics, which is the platform where the studies are run. So the original data on the server. We told them, “If you go to row 100 and you go to this column with the numbers, you will see that even though the data she posted has a one, on the server, the numbers actually are seven. Here are the 20 rows you have to check. If we are right, those numbers are sevens. If we are wrong, those numbers are ones.” And we thought, Harvard can check it immediately ’cause they have access to the Qualtrics data — we didn’t — and we thought maybe the following day we would know whether we were right or wrong. ‘Cause once you identify exactly how the data were modified and you have access to the original data, then you can check whether your hypothesis is correct or incorrect.
LEVITT: In the end, all of the original data which hadn’t been available to you becomes available. And a third party firm was hired to analyze the original data and to compare it to the altered data. And this third party confirmed that conjectures you had made, and they also found other ways that she had altered the data that you hadn’t found.
SIMONSOHN: Yeah, $25 million lawsuit later, information was made public; we realized, Yeah, that was right.
You’re listening to People I (Mostly) Admire. I’m Steve Levitt. After this short break, Uri Simonsohn and I will return to talk about the $25 million lawsuit.
* * *
In 2023, after accusing Francesca Gino of fraud, Uri Simonsohn, his Data Colada colleagues, and Harvard University were sued by Gino for $25 million. I, too, have been frivolously sued by a disgruntled academic. And even when you know the facts are on your side, it still eats up a ton of time, energy, and money. I’m curious how Uri felt while this legal threat was hanging over him.
SIMONSOHN: It was hard for a few weeks, I would say. It was hard in part because funding wise, like even if you’re right, just defending yourself in the American legal system, it’s very expensive.
LEVITT: And I heard that your university wasn’t willing to pay for your defense, which I find infuriating. Is that really true?
SIMONSOHN: No, no. They did end up paying for it. They did. The most generous one was my own school in Spain. But it was difficult because this is unheard of in Spain, and it’s also August. In August, nobody’s working in Spain. Like literally the university’s closed down. And I don’t know who to call when you get sued. No idea who that person is. So I find out the name of the person, I email them and they say, “Is this really important?” And I said, “Well, yes, we need to talk, like, as soon as possible.” But they were great. They were actually very generous. So we did what’s called a motion to dismiss, which is we tell the judge this is ridiculous, and if the judge agrees it’s over. And that costs about a hundred thousand dollars. So the university said, “We’ll pay that.” Now, if the judge disagrees and says, Let’s take it further. Let’s go into what’s called discovery, where both parties get each other’s emails and documents and so on. That could be a few other hundred thousand dollars. And none of the schools committed to funding us up to that point. So that was stressful. ‘Cause we could be on the hook for like a million dollars. It’s not like it’s up to a million dollars for something you did wrong or you made a mistake or you had an accident. It’s like you — you did something and you have that liability. But then, they did a GoFundMe for us.
LEVITT: So, when the academic community heard about this, the GoFundMe project was started, it raised hundreds and hundreds of thousands of dollars in almost no time at all. And that made me feel really good ’cause it’s a signal of how valuable the profession thinks your work is. That must have made you feel really good, right?
SIMONSOHN: Yeah, it’s the only time I’ve cried for professional reasons. It was an overwhelming feeling. ‘Cause you feel like it’s you against the world. And feeling like the community was supportive that really was amazing for us.
LEVITT: The judge has thrown out all of her claims against you, although the lawsuit with Harvard is still ongoing. Now, the thing that’s so crazy about this, which is almost like a bad Hollywood movie, is as you’re researching fraud by Francesca Gino, you end up stumbling onto, in the same paper, but a different part of the paper, apparent fraud by this leading behavioral scientist, Dan Ariely. How did that come to pass?
SIMONSOHN: We’re chatting with these younger researchers and we’re looking for smoking guns. We’re looking at this file and they show us very blunt evidence of fraud. And so this other study involved car insurance and they’re self reporting to the insurer, how much they drove, and that influences the premium they pay on their policy. At the end of the year, you would write down the odometer in your car, and they would compute how much you drove and then adjust the policy. And what this paper was showing is that you can get people to be more honest by having them sign before they enter the odometer reading instead of after doing so. And so the data was posted and these younger researchers noticed and showed us, Look, the distribution of miles driven is perfectly uniform between zero and 50,000, meaning just as many people drove a thousand miles as 3,000, 7,000, 50,000, etc. Every single number of miles equally likely, equally frequent, but there’s not a single person who drove 51,000 miles. Anybody who’s looked at data will immediately know that’s impossible. And a lot of people who have not looked at data who just driven will realize that doesn’t make sense, right?
LEVITT: Yeah. And so you probably presume that Gino had cheated on that too, right? Because it’s her paper. She was a co-author.
SIMONSOHN: We did. We did. That would’ve been the first smoking gun on the Gino case. That would’ve been. But I said, if my memory doesn’t fail me, I said, “This feels too clumsy.” It’s not like the Gino stuff is brilliantly done, but this feels like worse than that. It just doesn’t feel like the other studies we were looking at. Like, I was getting a flavor for like the fingerprint of what her data looks like, something funny happens in the extremes of the distribution. But this uniform business, it’s just different. And so we said, “Well, let’s see who created the file.” And we saw Dan Ariely’s name there. And that was the first time we really ever thought of Dan as possibly being involved in funny business. So we contacted them and immediately Dan said, “No, if anything went wrong, none of the authors would be responsible for it. Only I would be responsible for it.” So he immediately took ownership of that. We had a blog post on that, and that drew a lot of attention. But then, there were no other public data sets. And so, our view of the time, and I don’t think I’ve talked to people about this before — our view at the time was, Okay, who can get to the bottom of it? And so we thought that in the insurance data, only an investigative reporter could. Somebody needed to go talk to the insurance company, was our thought.
LEVITT: Because they’re the ones who had provided the original version of the data, which was later altered to no longer look like the original data. And you needed that comparison.
SIMONSOHN: That’s right. And we thought only an investigative journalist would get that. And that was actually true. A reporter for The New Yorker spent a considerable time, and he was able to get the data and he was able to find, in my mind, irrefutable proof that the data were altered after they were sent to Dan.
LEVITT: What I think is really strange and troubling is that the investigations of potentially crooked researchers falls on the institutions that they work for, and those institutions have such strong incentives not to find them guilty. In stark contrast to the Gino case where she’s lost her tenure and Harvard’s been very public about it, Duke has taken a very different stance. An investigation was done. It was done extremely secretively. Duke hasn’t talked about it at all, which is interesting to me because it let Dan Ariely, himself, be the voice of describing what the outcome of the investigation is. And he’s no longer a named professor, which is some form of punishment, but obviously a much less severe punishment than losing tenure. But I don’t know, it seems to me like a failure of the institutions to police themselves.
SIMONSOHN: Yeah, so a couple thoughts on that. So one is, the most common outcome is secret. A secret resolution, an agreement the university says, “If you leave, we’ll give you this much money, you stay quiet. And we’re all happy.” And they just say, “We don’t comment on labor issues,” or whatever the term is for unemployment decisions at the university. But it’s worth keeping in mind comparing Dan to Francesca, that it’s unlikely that Duke was able to get data of the caliber that Harvard was able to get. Just because of the nature of Dan’s experiment versus Francesca Gino. So I’m convinced — there’s no room for doubt — that the insurance data is fraudulent. And I don’t know of an alternative explanation that’s plausible to Dan having done it, but it hasn’t been proven. That’s just my belief based on all the evidence that’s available. So, it’s not just that because it’s a man or a woman, or more famous, less famous, or Duke versus Harvard. It’s not matched on the strength of evidence of wrongdoing.
LEVITT: Do you think that these high-profile fraud cases will have or have had a big deterrent effect scaring off others from cheating? If I were a cheater, I would be very afraid of you. But on the other hand, when punishments that get handed out are so uneven, then it really says, “Well, look, I might get caught. It might be embarrassing, but might not end my career, so I can do it.” What’s your feeling about the deterrent effect that what you’re doing is having?
SIMONSOHN: I don’t know the facts, but I can tell you, like, rationally, you shouldn’t be less likely to commit fraud after this experience because what you will learn is, there’s no real punishment. ‘Cause if the worst thing that can happen to you is that you’re fired, but without fraud, you would be fired. It’s still a win-win. For somebody like that to commit fraud, there’s no real disincentive. ‘Cause the worst that can happen is that they don’t do it anymore. And so that’s why I think the rest of us have to take action to prevent it, like to not be complicit in making it so easy for them.
LEVITT: You mentioned that you’d received funding for a platform that prevents fraud. Could you tell me more about that?
SIMONSOHN: So it’s called AsCollected. It’s a spinoff, so to speak, of our website for pre-registration, which is called AsPredicted. And the idea is some version of a data receipt. So if you go to a conference and you buy lunch somewhere, and you want to be reimbursed, in most cases you don’t just tell the university I spent $7, you need to show them a receipt. But then if you tell them, I collected data, they don’t ask you for a receipt. And so the idea is that provide a written record of where the results come from. And that’s a combination of how the data were obtained and how the data were analyzed. So the first question would be, Is your data public or private? You would say, It’s private. And you would say, Can you name the company that gave you the data? You would say yes or no. If you say, No, it asks you, Do you have a contract that prevents you from disclosing who they are? And if you say, Yes, it asks you, Who in your institution signed the contract? And then it asks you, How do you receive the data? And you say something like, I receive an email on such and such date with that spreadsheet that I analyzed. You indicate who received the data, who cleaned it, and who analyzed it. And the final output is two tables. A table with the when, what, and how. And a table with the who. We have experience with about 15 different cases of fraud. All of these cases would’ve been so much harder to pull off if you had to answer these simple questions because now you have nowhere to hide. So deliverable is a URL. You have a unique URL that has those two tables, and the idea is that journals hopefully will just require you at submission to include the URL. We think our customer here is deans, journals, granting agencies. They want to take care of fraud, but they want somebody else to take care of fraud. And so we’re telling them, Look, all you have to do, ask for the URL and you’ve done your part.
LEVITT: So the forces that are pushing people to cheat either in small ways or in really outright ways are related to the really strong incentives that exist within academics and the high hurdle for tenure. Do you think that the academic tenure system is broken? Or do you think it’s just a pretty good system that has strong incentives and strong incentives have a lot of benefits, which is that they make people work hard and try to be creative and costs, which are that in extreme cases, people respond to incentives too strongly and in the wrong ways?
SIMONSOHN: Many people blame the incentives for fraud and for P-hacking, and for all these processes people take that lead to bad quality outcomes. I don’t so much blame that. I think it makes sense for us to prefer researchers who publish over those who don’t; those who come out with interesting results over who don’t. That’s a bit of a minority view. I’m okay with rewarding the person who’s successful over the one who’s not. But the part of the incentives that I think is broken is the lack of a penalty for low quality inputs. And part of the reason for that is that it’s so hard for the reviewers to really evaluate the work. One way to think about the whole movement towards transparency is to make it easier to align the incentives. So given that we reward good outcomes, it’s very important to make sure the inputs are legit. And the only way for people who are just doing this voluntarily to do that is that it needs to be easy for them. It needs to be easy for them to know if there was P-hacking. It needs to be easy for them to know if you made a mistake. And that requires transparency.
Despite losing tenure from Harvard, Francesca Gino maintains her innocence. After Duke conducted its investigation into Dan Ariely, Ariely wrote a response that Duke approved. In it, he said that, quote, “Duke’s administration looked thoroughly at my work and found no evidence to support claims that I falsified the data or knowingly used falsified data in general.” I’ve been a long-time advocate for making data analysis a core topic in K-12 education. My goal isn’t to turn everyone into a data scientist, it’s to equip the next generation to be thoughtful consumers of data analysis. Uri Simonsohn is providing an incredibly valuable service, debunking individual studies and developing strategies and tools for rooting out bad data analysis. But there’s only one Uri. We would need thousands of Uris to keep up with the flow of misleading studies. Everyone needs to be able to protect themself by knowing enough about data analysis to be an informed consumer and citizen. Meanwhile, if you’d like to hear more about the problem of fraud in academic research and the steps that some people are taking to fight it, check out episodes 572 and 573 of Freakonomics Radio, which you can find in your podcast app.
LEVITT: This is the point in the show where I welcome on my producer Morgan to tackle a listener question.
LEVEY: Hi, Steve. So in our last new episode we had an interview with climate scientist Kate Marvel, and at the end of the episode, you polled our listeners. You wanted to know whether people were optimistic or pessimistic about our future climate, 50 years in the future. So 50 years from today. You wanted to know if A, they were optimistic or pessimistic. B, their age, and C, their country, and you have tallied up the responses.
LEVITT: I have. So as usual, we got a very enthusiastic response from our listeners. So let me start with the most basic question, Morgan. What share of respondents would you say were optimistic?
LEVEY: I am going to go against my gut, which is never a good idea, but I’m going to say 65 percent were optimistic.
LEVITT: Alright, so the answer was 42.6 percent, which, when you read the responses, you just realized what a terrible question it was because nobody really knew how to answer it and there were a fair amount of wafflers. I left those out of that calculation. So about 15 percent of the people clearly waffled. They didn’t want to take a stance. But what was interesting, and it really was what prompted the question in the first place, is that the kinds of logic and arguments and data that people sent were pretty similar of the optimistic and the pessimistic. It’s just a really hard forecasting problem and I think for a lot of people to try to make it this black and white comparison between pessimistic and optimistic was just a really hard challenge.
LEVEY: So do you mean that people who are optimistic or pessimistic were pointing to the same information and then just coming away with different opinions about it?
LEVITT: Yeah. I would say the responses were remarkably thoughtful and the people who were pessimistic gave really good arguments about why they should be pessimistic. And I think there were the kind of arguments that optimists wouldn’t disagree with, and the same with the optimists. At some basic level, it’s probably just not that clear whether you should be optimistic or pessimistic.
LEVITT: Okay. So that’s the one piece of data that we collected that is really legitimate. Now, what we’re going to do next is we’re going to do data analysis the way psychologists did it 20 years ago. It’s exactly Uri Simonsohn’s point in that paper about “When I’m Sixty-Four.” I built in lots of degrees of freedom in my survey. ‘Cause I know how old people are, I know what country they’re from. I can deduce their gender based on their name. But then there are also a lot of subtle dimensions like, Did they respond within the first 24 or 48 hours? Did they respond in the morning or the nighttime? So in this kind of setting, you have almost infinite possibilities to try to create something interesting when there’s nothing interesting. And I really want to highlight that because if you’re a passive listener, even after this episode that just emphasized how people with data can kind of trick you, I think there’s a good chance I could have tricked you by talking about what we’re going to do next like, it’s science when really there’s nothing scientific about it at all. It’s just a way to try to have fun with data.
LEVEY: Okay, so what was the first lever?
LEVITT: The first lever is age. Okay. And actually, as I suspected when I did the survey, the data about the demographics to me turned out to be more interesting than the answers about optimism and pessimism. So what do you think the median age was among our respondents?
LEVEY: 46.
LEVITT: Not bad — 42. I was expecting younger. Okay. So then let’s tackle the question. Do you think, if you then divide our sample into the people who are younger than the median age, so younger than 42 versus older than 42, which group do you think came back as more optimistic about the future of climate?
LEVEY: I think the younger people were more optimistic and older were more pessimistic.
LEVITT: So that is true; 47 percent of the younger people were optimistic versus 39 percent of the older people. Now, that is not statistically significant. None of the things I’m about to tell you anything related to pessimism or optimism turns out to be significant. This was actually one of those rare cases where even when I tried to cut the data in a bunch of different ways. I could not find a single one that was statistically significant. So there was really a whole lot of nothing going on in this data. So I couldn’t even do gender because as usual, we have this incredible gender skew in the data. So this time, 84 percent of the respondents were men based on my analysis of their names. Given that, it turns out men were slightly more optimistic, but again, not statistically significant at all. Okay. So geography is the last one I want to talk about. What share of respondents do you think are from the United States?
LEVEY: 75 percent.
LEVITT: Yeah, so I would’ve expected two-thirds, because two-thirds of our downloads are in the United States, but only 49 percent of the respondents were American. In particular, the thing that was completely and totally crazy is this: Canada represents about 7.5 percent of our downloads, and over 20 percent of our responses were from Canada. Which is just really interesting. Just to put it in perspective, 40 percent of the women who responded were Canadian. If it weren’t for the Canadian women, we would’ve hardly had any women at all. Now, the Canadians didn’t break as either particularly pessimistic or optimistic. It was just really interesting that they were engaged. The Australians were the same thing. The Australians were about three times as likely, to respond as they have downloads.
LEVEY: That’s not surprising. We have a very active Australian listener base.
LEVITT: Yeah, that’s absolutely true. So those are the only two things for which I found statistical significance in the entire dataset was that the Canadians and the Australians were very fervent responders.
LEVEY: Was there another lever you pulled?
LEVITT: Well, so I did what Uri talked about, which is I looked at all of the cross tabs. Okay. What about, foreign women or old Americans? And none of those showed anything at all. Honestly, I kind of ran out of steam after a while trying to look at all of the different levers because the stakes were low. If I had actually done this as an experiment, invested lots of time, and I were a psychologist 20 years ago, I probably would’ve put a lot more effort into cutting in all the different ways because I was looking at an important publication. Whereas here I’m just trying to fill a couple minutes on a podcast.
LEVEY: Listeners, if you have a question for us, if you have a problem that could use an economic solution, if you have a guest idea for us, our email is PIMA@Freakonomics.com. That’s P-I-M-A@Freakonomics.com. We read every email that’s sent and we look forward to reading yours.
In two weeks, we’re back with a brand new episode featuring Nobel Prize winning astrophysicist Adam Riess. His research is challenging the most basic ideas we have about the universe. As always, thanks for listening, and we’ll see you back soon.
* * *
People I (Mostly) Admire is part of the Freakonomics Radio Network, which also includes Freakonomics Radio and The Economics of Everyday Things. All our shows are produced by Stitcher and Renbud Radio. This episode was produced by Morgan Levey, and mixed by Jasmin Klinger. We had research assistance from Daniel Moritz-Rabson. Our theme music was composed by Luis Guerra. We can be reached at PIMA@Freakonomics.com, that’s P-I-M-A@Freakonomics.com. Thanks for listening.
LEVITT: It’s funny, I forget how old I am.
Sources
- Uri Simonsohn, professor of behavioral science at Esade Business School.
Resources
- “Gino v. President and Fellows of Harvard College,” (Court Listener, 2025).
- “Statement from Dan Ariely,” (2024).
- “Data Falsificada (Part 4): ‘Forgetting The Words,'” by Uri Simonsohn, Leif Nelson, and Joe Simmons (Data Colada, 2023).
- “They Studied Dishonesty. Was Their Work a Lie?” by Gideon Lewis-Kraus (The New Yorker, 2023).
- “Evidence of Fraud in an Influential Field Experiment About Dishonesty,” by Uri Simonsohn, Leif Nelson, and Joe Simmons (Data Colada, 2023).
- “Signing at the beginning makes ethics salient anddecreases dishonest self-reports in comparison tosigning at the end,” by Lisa Shu, Nina Mazar, Francesca Gino, Dan Ariely, and Max Bazerman (PNAS, 2021).
- “Power Posing: Reassessing The Evidence Behind The Most Popular TED Talk,” by Uri Simonsohn and Joe Simmons (Data Colada, 2015).
- “Your Body Language May Shape Who You Are,” by Amy Cuddy (TED, 2012).
- “Daily Horizons: Evidence of Narrow Bracketing in Judgment from 10 Years of MBA-Admission Interviews,” by Uri Simohnson and Francesa Gino (Psychological Science, 2012).
- “Spurious? Name similarity effects (implicit egotism) in marriage, job, and moving decisions,” by Uri Simohnson (Journal of Personality and Social Psychology, 2011).
- “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant,” by Joe Simmons, Leif Nelson, and Uri Simohnson (Psychological Science, 2011).
Extras
- “Will We Solve the Climate Problem?” by People I (Mostly) Admire (2025).
- “Why Is There So Much Fraud in Academia?” by Freakonomics Radio (2024).
- “When I’m Sixty Four,” by The Beatles (1967).
Comments