Search the Site

Episode Transcript

It’s already been a decade since I.B.M.’s supercomputer Watson beat two Jeopardy! champions.

Alex TREBEK: Now we come to Watson. Who is Bram Stoker? And the wager? Hello, $17,973 and a two-day total of $77,147!

Then and now, a lot of people have wondered: when will computers get smart enough to replace us? And if so, at what cost?

NARRATOR: Enter the computer, and a new age.

*          *          *

From the Freakonomics Radio Network, this is Freakonomics, M.D.

*          *          *

I’m Bapu Jena. I’m a medical doctor and an economist. Each episode, I dissect an interesting question at the sweet spot between health and economics.

Today on the show: it was always a matter of time before technological advances in computing made their way into medicine. The falling price of data storage and increases in computing speed laid the foundation for an explosion in healthcare data. And now, artificial intelligence, or A.I., promises to make that data useful. But are we actually at a place where A.I. can help doctors treat patients?

We throw around the term artificial intelligence a lot. But what does it really mean?

OBERMEYER: Artificial intelligence is fundamentally about pattern recognition.

Ziad Obermeyer is a professor at U.C. Berkeley and an emergency-room doctor.

OBERMEYER: We’re often interested in predicting something about the world what’s going to happen, what’s happening now. And that’s exactly what Amazon does when they’re showing you a product. They’ve made a prediction based on all of the data that they see about you that you are likely to buy this product. And so, all of the techniques are the same in health. We’re just using the vast troves of data that we have access to in electronic medical records, in imaging databases, and all sorts of other things to make a prediction about what’s going to happen to you.

A.I. is a pretty broad term. It’s often used interchangeably with machine learning, which is the more technical name for the process that Ziad just described: extracting information from huge troves of data and predicting things we may care about using sophisticated algorithms. Ultimately, in medicine, the goal of A.I. is to help doctors make better decisions when it comes to the two problems that they face: making the right diagnosis and picking the right treatment. Ziad understands these problems firsthand.

OBERMEYER: If you are working in healthcare and you’re paying attention, you’re always noticing how things go wrong how you’re forced to answer these really complex questions in very short periods of time. And I think that being very, very stressed and preoccupied by my mistakes was what led me into the research that I’m doing now, which is how artificial intelligence might be able to help doctors make better decisions.

E.R. doctors, like many healthcare professionals, are under a lot of stress, which research shows can lead to more errors.

OBERMEYER: In the E.R., you’re seeing about 20 or 30 patients in a 10-hour shift. You’ve got, you know, a few minutes to talk to the patient, to review all of their medical history. There’s just this huge dump of information that is in the record, that you can get from the patient by talking to her or by examining her. And then there’s all the information you can get from tests, but then you’ve got to figure out which tests to order and then what to do with them.

Gurpreet DHALIWAL: There’s lots of different perspectives and even philosophies on the diagnostic process. But a simple one is that it’s a categorization task.

That’s Gurpreet Dhaliwal. Gurpreet is an expert diagnostician and he’s helped us untangle a couple of medical mysteries on this show before. He specializes in how doctors think and come to conclusions about their patients’ conditions.

DHALIWAL: So, a patient has a concern or a set of symptoms that brings them into healthcare. The doctor listens to those concerns and symptoms, but then also starts adding on other data, like things we find from the physical examination blood tests and labs that we run, x-rays, and other imaging studies And we put all of them together, sort of in a, in an argument. We make an argument that says it’s most likely that this patient’s constellation of data is explained by pneumonia or a heart attack or appendicitis.

Appendicitis actually provides a good example of how doctors don’t always get these diagnostic decisions right … even when they try to diagnose themselves. Ziad himself once walked around with appendicitis — not realizing the pain for what it was until his appendix burst.

OBERMEYER: It was really humbling because there’s a shortlist of things that as an emergency doctor, you’re really not supposed to miss. And this is one of them, and I’d missed it in myself.

Overall, Gurpreet says roughly 10 percent of the time, doctors get a diagnosis wrong or maybe just don’t get it right away.

DHALIWAL: Sometimes with minor consequences but sometimes with significant ones. Given the complexity of what a physician and healthcare providers do, you know, getting it right nine out of every 10 times is a pretty good rate. But that residual 10 percent is our motivation. That one out of 10 patient matters a lot, and to the doctor too.

Doctors get diagnoses wrong for a lot of different reasons, but one of the reasons is that they just don’t get the right data. They may not ask a patient a critical question or they may not think to order an important lab test. In a new paper, Gurpreet and his co-authors found that when trying to diagnose infectious diseases, doctors ordered the wrong lab tests or didn’t seek consults with specialists early enough, which delayed diagnoses. The sheer number of things a doctor has to consider to arrive at a diagnosis is also an issue.

DHALIWAL: There is definitely decision fatigue. In fact, there’s a phrase that’s applied to medical practice called decision density, which is the number of decisions you have to make per unit time, like in an hour. And in some settings, like the emergency room, it’s really massive.

Of course, one of the things that makes medicine challenging, perhaps unlike lots of other fields, is that when a mistake happens we don’t always know it.

OBERMEYER: We never see the diagnoses that don’t get made.

So, Ziad wondered, if he could figure out a way to find diagnoses that didn’t get made, could he design an algorithm that would help doctors not make that mistake?

Turns out that he could. Ziad and his co-author, Sendhil Mullainathan, an economist at the University of Chicago, used the electronic health data from a big, top-ranked hospital. They analyzed roughly 250,000 emergency-room visits from more than 100,000 patients. So, a lot of data. Data that included information about each visit, and what was done to patients. And then, they brought in A.I: a software program they built to make a prediction about whether or not a patient was likely to have a heart attack when at the E.R.

OBERMEYER: It’s one of the most common causes of death and disability in the country and in the world, but it’s really hard to diagnose.

When a patient walks into an E.R. complaining of chest pain, it’s not always clear it’s because of a heart attack. And testing for proof of a heart attack is costly and sometimes invasive. So, that was also why Ziad wanted to focus on heart attacks. They wanted to figure out whether the algorithm could predict, any better than doctors, which patients had blockages in the blood vessels that supply the heart.

OBERMEYER: And we can then look to see, “Okay, well, who was right, the doctor or the algorithm?”

Ziad’s analysis uncovered two kinds of mistakes that doctors made.

OBERMEYER: The first one probably won’t be a surprise to anyone who is a regular listener to this podcast, which is that doctors do way too many tests on low-risk patients. So, the algorithm predicts that a huge set of patients about two-thirds of all of the patients that doctors choose to test today the algorithm says, “Don’t test this person. It’s definitely not going to be positive.” And low and behold, when the doctors test them, two-thirds of them should not have been tested based on a, you know, very standard cost-benefit analysis with those tests.

And the other mistake?

OBERMEYER: Now the second finding is perhaps more surprising, which is that there are a bunch of patients that the algorithm says, “Oh, this person is really high risk. You should definitely test them.” And doctors don’t test them.

It might otherwise be really hard to know what ultimately happened to those people — the ones that didn’t get their hearts tested. Were they harmed by not getting tested? Well, Ziad and his coauthor figured out a way around that.

OBERMEYER: The solution is we’ve got longitudinal data on these people.

Meaning, the researchers could see what happened to them after their E.R. visit.

OBERMEYER: And it turns out that those people have adverse events that look like missed heart attack at very high rates like, much higher than the clinical guidelines would have suggested testing at. And so, we’re able to use those records in combination with the algorithm’s predictions to find patients that doctors are missing, that look like they had a heart attack that was never diagnosed.

So, the computers won — this time, at least. Now, Ziad’s analysis used data from one hospital. It was a big hospital, but still, it was just one. So, they ran the same analysis using national Medicare data.

OBERMEYER: And that was in the, I think, millions of patients visiting E.R.s across the country. And we found very similar results.

Turns out, medical testing is a great problem for AI to help solve.

OBERMEYER: The doctor’s job is to make a prediction on someone’s heart attack risk and test people who are above a threshold of risk because those are the people that are likely to benefit from that test and the treatments that come from a positive test. And that’s exactly what the algorithm helps with. It’s with that testing decision by eliminating testing of people below that cost-benefit threshold and increasing testing of the people above that threshold. It puts the focus on the fact that these decisions are incredibly hard, and it’s not surprising that people make mistakes.

So, Ziad’s paper, which is coming out soon in The Quarterly Journal of Economics, shows that A.I. may be better than doctors at predicting who’s at risk of having a heart attack. Better, at least in these high-stakes, E.R. cases where doctors have to make diagnoses and order tests fast. But how else can A.I. and high-end computing help doctors? And: is this kind of technology really all it’s cracked up to be? That’s coming up on Freakonomics, M.D., right after this.

*          *          *

Machine learning and artificial intelligence aren’t a panacea for the diagnostic errors doctors make — at least not yet. And that’s because we need to learn how to teach these machines the right lessons. Again, here’s Ziad Obermeyer, the E.R. doctor and A.I. researcher.

OBERMEYER: We might not want to unthinkingly automate what a doctor thinks about a patient. We want algorithms to learn about what actually happens to the patient. And so that’s the thing that I think is really, really promising training algorithms to learn from patients, not from doctors, to look at the experiences that a patient reports, to look at the outcomes that that patient goes on to have, and to train an algorithm to learn about those things as a way of helping the doctor do more than she currently can today, not just spitting back out to the doctor what another doctor already knows.

Because spitting back what a doctor may already know can have big consequences.

OBERMEYER: So, when algorithms learn from data that come out of this very biased healthcare system, they can also learn those biases and then reinforce them and perpetuate them.

What kind of biases might these algorithms pick up? Well, let’s look at an example from a recent paper by Ziad and his collaborators. It came out in the journal Science in 2019. They started with the fact that some people just have really complex health needs. And inevitably, these patients will suffer from problems related to their chronic conditions that could have been prevented. For instance:

OBERMEYER: Their diabetes could lead them to need a toe amputation.

These conditions, when they’re really severe, typically affect a small group of people; but they’re expensive. That means many insurance companies are incentivized to identify high-cost patients before they face those preventable complications.

OBERMEYER: And in so doing, both help the patient with their health but also save the healthcare system a lot of money. So, when this goes well, everybody wins and it’s great.

That’s where the algorithms come in.

OBERMEYER: What these algorithms do is something that you’d think algorithms would be very good at doing, which is looking into the future and predicting who’s going to need a lot of health care.

These algorithms are used a lot in the insurance business and in healthcare systems. So, Ziad and his colleagues wanted to poke into them a little bit more. They looked at a commonly used algorithm developed by a company. Algorithms like it affect around 200 million people a year. When they looked at which patients the algorithms selected, they found something surprising — and unsettling.

OBERMEYER: What we found was that these algorithms, and in particular the one that we studied, had an enormous amount of racial bias. Effectively this algorithm was helping healthier white patients cut in line ahead of sicker Black patients for access to this extra help with their chronic health needs.

How did the researchers figure this out?

OBERMEYER: If you looked at patients who had the same health needs, the Black patient on average was given a lower score than the white patient and so was prioritized further back in line than the white patient.

The researchers found that by adjusting the algorithm, they could increase the percentage of Black patients getting extra help from almost 20 percent, to nearly 50 percent.

OBERMEYER: It’s a huge amount of bias. Like, had this algorithm been unbiased, we would have more than doubled the fraction of Black patients in that high-priority group that was fast-tracked into these extra-help programs. So, the scale was enormous, like tens of millions of patients every year. And the magnitude of the bias was also very large. So, yeah, this was a huge problem.

So, how did the algorithm create this bias to begin with?

OBERMEYER: The algorithms were trained to predict how much healthcare costs a given patient was going to generate. But that is not a great proxy for someone’s healthcare needs. So, they’re correlated. And that’s because sick people generally go to the hospital, where they generate costs. And that happens for Black patients and white patients alike. But the problem is that they don’t generate costs at the same rate. And that’s because Black patients on average face more barriers to access. There’s a lot of structural bias in the system that prevents people from ever making it into the healthcare system to begin with. And then there’s also racism in medicine itself. So, there are many studies that show that doctors making decisions about two otherwise identical patients will tend to recommend less care for the Black patient. And so those things come together to mean that Black patients who have the same health as white patients are going to have lower costs because of those barriers to access and racism.

What Ziad is saying is a little complicated, so let me unpack it a bit. Because of barriers to medical care, Black patients who are similarly ill as white patients tend to get less medical care. Because of that, Black patients “look” to the insurer like they are less sick. They see the doctor less; they get fewer prescriptions. So, when an insurer is trying to figure out who its sickest patients are that need interventions to keep them out of the hospital or prevent disease complications, the insurer will tend to miss a lot of Black patients who don’t appear as sick as they really are, because they didn’t use medical care that — turns out — they may not have had good access to. This ultimately boils down to a problem with the data. The ways in which insurers try to figure out who’s really sick and who isn’t is plagued by bias in who has access to care and who doesn’t.

DHALIWAL: The system is only as good as the data we put into it.

That, again, is Gurpreet Dhaliwal, professor at U.C. San Francisco.

DHALIWAL: It’s no different than us in life. It’s our experiences that inform what we know. And so, if the human mislabels either because of an error or bias on their part then the computer learns it. So, we worry sometimes about “garbage in, garbage out” in artificial intelligence. But there’s also “bias in, bias out,” and I think that’s something we’re all concerned about.

So, technology is only as good as the humans that make it. But of course, we humans are flawed, sometimes deeply flawed, and if we’re not careful, machines will learn, replicate, and perpetuate those flaws.

But what if we are careful? What if we could specifically design technology to help us overcome the biases that we have? Ziad and a group of researchers have actually created an algorithm to try to ensure Black and white patients receive the same quality of care.

OBERMEYER: If you look at the population level at Black patients versus white patients, at just the levels of pain that they experience on a day-to-day basis, it’s just shocking how much more pain Black patients experience than white patients.

Ziad and his team focused on the treatment of one specific type of pain — achy knees.

OBERMEYER: It’s not only that Black patients have more arthritis in the knee. They do but when you take that into account, and you just look at people whose x-rays look the same to a radiologist, Black patients still report much more pain than white patients. So, there’s this mystery that, like, a radiologist has looked at the knee x-rays. These knees look the same. And yet, the Black patient is far more likely to report a lot of pain in that knee than the white patient.

So, why might this be? Why would a Black patient report more knee pain than a white patient for what appears, to a radiologist, to be similar-looking knees?

OBERMEYER: Well, there’s a lot of actually very good research on how stress and mental illness and just the burdens of everyday life can actually lead to higher reports of pain in the knee and elsewhere. And those things are more common in Black patients than white patients. And so, a lot of the explanations have focused on these external factors as an explanation for what we think of as the pain gap, the gap in pain between Black and white patients.

What Ziad is saying is that even in two individuals with knees that look similarly damaged on an x-ray, Black patients may report more pain than white patients for reasons that are important but unrelated to the actual pathology in the knee.

But what if that wasn’t the whole story? What if those knees that looked similar to the radiologist’s eye, weren’t the same? What if there were differences in those x-ray images that only a machine could pick up on?

Ziad and his colleagues wondered about this possibility and so they trained an algorithm to look at those same x-rays of people’s knees and make a prediction about who would have pain.

OBERMEYER: Not a prediction about what the doctor would say about that patient’s knee, rather a prediction on what the patient would say about their knee. So, we just trained an algorithm to look at an x-ray and predict: does this look like a painful knee or not?

What did they find?

OBERMEYER: That algorithm — not only did it do a better job of explaining pain, like, who was in pain and who wasn’t in pain on the basis of just the x-ray image, it also explained an enormous amount of that gap in pain between Black and white patients. So, the algorithm was doing a better job explaining pain for everybody, but it was doing a particularly good job at explaining the pain that was in Black patients’ knees and causing that increase in pain relative to white patients. So, it was finding things that human radiologists were missing that were causes of real, genuine, organic knee pain. Not depression, not stress, not anything else about the patient’s environment or the medical system. It was new things in the knee that radiologists don’t know about.

The information present in the x-ray — features of the knee image that the machine identified but that radiologists didn’t — meant that the higher reported rate of knee pain in Black patients was partly due to pathologic changes in the knee that just went unnoticed by the radiologist’s eye. In other words, the pain was real. It wasn’t just in someone’s head, or a result of things like stress. This finding has huge implications.

OBERMEYER: The reason it’s not at all implausible that this algorithm was discovering new medical knowledge about knee arthritis is because when you look back at the original studies of knee arthritis, like, how do we know anything about arthritis of the knee? Well, all those studies were done in the 1950s in England in coal miners who were all white and male, and definitely not representative of the kinds of patients that doctors are seeing in their offices today. And so, the algorithm by learning from the experiences of patients, rather than just spitting back out what doctors already know, the algorithm had discovered new things about the knee and what is painful. And those things were particularly common in Black patients.

Ziad’s paper was published in Nature Medicine in 2021. As for those “new things in the knee” that their algorithm picked up? Well, they are working with a Ph.D. student at Stanford to pinpoint exactly what those are. And, I have to say, all this gives me a lot of hope — hope that technology can be part of the solution and not just part of the problem, that it can help doctors and other medical professionals overcome some of their flaws. Here’s Gurpreet again.

DHALIWAL: In the last couple of years, we’ve accumulated a couple of examples. They tend to be in visual areas or occasionally in auditory areas. So, there are programs now that have shown that they can predict whether a pigmented skin lesion is a benign mole or a skin cancer, or whether an x-ray shows lung cancer or not. So, there are now demonstrations where the computer has matched or outperformed the doctor primarily in research settings, but they are impressive findings.

A 2016 paper in JAMA found that an algorithm was really good at using photos to detect diabetic retinopathy — that’s a complication of diabetes that can cause poor vision or even blindness. How good was the algorithm? It had sensitivity rates upwards of 90 percent.

So, is this something that we can really use today? There are so many questions that, I think, we need to consider first. For example: what are the legal issues around using A.I. to help doctors make diagnoses?

DHALIWAL: Let’s say my A.I. system is more accurate. Like, there’s a system in the office, and it’s shown to be statistically more accurate than the doctor at calling something lung cancer, for instance, or skin cancer. Then if that system makes the wrong call, and we learn later that it said, “Don’t worry, it’s not cancer, or it’s unlikely to be cancer.” And we learn unfortunately it was a cancer. Who is responsible for that? I think that sort of thing is not sorted out yet.

It’s a bit like what’s happening right now with self-driving cars. Carmakers and car insurers are still trying to answer similar questions about who’s responsible if there’s a crash and an autonomous vehicle hurts someone. And also, if you have any hesitation at all about something like self-driving cars, how much would you, as a human, be willing to trust a diagnosis that came from a computer?

DHALIWAL: This gets at the issue in A.I. systems with what’s called the black box issue, which is that the more sophisticated the neural network gets that makes these predictions, the less the physician who might be receiving the output of this, or even perhaps the computer scientists who built it can understand why the computer came to the conclusion that it did. On balance, it’s statistically more correct. But you have to ask yourself how comfortable humans are when you can’t weave a story for why I think you have a disease, and I just say, “Well, the computer thinks you did.” I didn’t really have an appreciation of this because oftentimes, when we tell a patient a conclusion, you know, we are apt to say, “These are the reasons I think you have an infection or arthritis or a cancer or Covid infection.” You know, we sort of lay out the data, sometimes in a very brief way. But these very sophisticated ones, sometimes it’s just, “The computer says so.” And I’m not sure humans will accept that scenario.

So, how would Gurpreet like to use technology like A.I. in his everyday practice?

DHALIWAL: Anytime the computer could offload or at least assist in a decision, it would be terrific. I sometimes fantasize about a place where the computer is doing something smart that really assists me in a way that helps the patient. So, it’s not quite replacing what I have to do I still need to make the decision but all the inputs have been facilitated. Like, an example would be what if the computer was smart enough so that when I’m talking to the patient, it was able to listen in, use something like natural language processing, and do a task like recording the medical note, which is rather clerical, and sort of typing into the computer, but also thinking in the background and say, you know, “I heard that this patient has an occupation in a quarry, and you’re talking to him about shortness of breath. Maybe we should think about lung diseases related to that.” Or, “I heard that you’re talking to the patient about shortness of breath.

I’ve pulled up the most recent ultrasound of the heart, an echocardiogram for you to look at,” so that I’m staying focused on the patient and connecting with them and informing them. And the computer is doing a lot of that heavy lifting in the background, but the decision-making still sits with me. One analogy that I really like is thinking of a golf caddy. I do play a mean mini golf, but I’ve never played golf, but I understand what people — when they do play golf, if they’re lucky enough, they have a really good caddy. What the caddy does is sort of sizes up the situation, says, “This is the club that you need in this moment just to get to the next hole.” They hand it over, and then the professional, the golfer, gets to stay in the flow of the moment. So, I think what we’re looking for is a good caddy for every step along the way in the diagnostic process.

I want to end today’s episode with a little reflection on what the future holds. I don’t think that machines will replace doctors. Being a doctor is about more than making a diagnosis. It’s about relationships and trust. And that’ll always be true.

But it’s impossible to think that machines won’t give doctors a run for their money. It may not be in the next ten years, but just think about how different our lives are today — both in technology and medicine — compared to thirty years ago. If I had to bet, I think that when my kids are my age today, computers will be able to identify a lot of diseases as or more accurately than doctors — and faster.

Part of getting to that point will be the democratization of data. Right now, healthcare data are siloed and hard to access. That’s why Ziad helped launch Nightingale Open Science. It’s a nonprofit that houses health data focused on medical images that researchers can use. Things like x-rays and electrocardiograms.

We also need to solve a completely different problem: human capital. Health care has to compete with Silicon Valley for the talent needed to really build out this technology well. Why work for a big hospital system trying to identify sepsis earlier in patients when you could work for a flashy company like Google, which might pay you more?

And, in the end, whatever we do come up with, we need to hold it to the same standards of evaluation as all other medical care. That means randomized trials that show that A.I. improves patient outcomes compared to relying on physicians alone, something Ziad and his colleagues are doing right now.

OBERMEYER: We can’t just be, you know, willy-nilly deploying algorithms and just crossing our fingers and hoping for the best. And so, a really important project that I’m working on now is translating that algorithm that we discussed about testing for heart attack in the E.R. into a big, randomized trial at a health system with a bunch of different hospitals. We’re going to roll that algorithm out at some of those hospitals, to some of those doctors and not to others. And then we’re going to be able to compare those two groups so that we can quantify the impact of that algorithm on patients’ health. And I think that’s the standard that we need to hold, not just algorithms, but any new technology and health to. You know, is it doing what we want it to be doing? Is it worth it? I think once we start seeing more and more not just algorithms developed, but trials that show that algorithms can make a huge difference for patient outcomes, that can improve doctor decision-making, that can lower costs while at the same time improving health, I think adoption is going to take off.

Thanks to Ziad and Gurpreet for sharing their knowledge and research with us today. By the way, Stephen Dubner has covered the automation revolution on Freakonomics Radio. You can find that episode — No. 461, “How to Stop Worrying and Love the Robot Apocalypse” — and links to the research I talked about today at You can also hear a great interview with Ziad’s co-author Sendil Mullainathan, on the show People I (Mostly) Admire. That’s another show in the Freakonomics Radio Network. If you’d like to contact this show directly, send an email to So, many listeners have written in with great episode ideas and interesting questions — keep them coming! And don’t forget to leave a review on Apple podcasts. It really helps us out. Thanks for listening!

*          *          *

Freakonomics, M.D. is part of the Freakonomics Radio Network, which also includes Freakonomics Radio, No Stupid Questions, and People I (Mostly) Admire. All our shows are produced by Stitcher and Renbud Radio. You can find us on Twitter and Instagram at @drbapupod. Original music composed by Luis Guerra. This episode was produced by Mary Diduch and mixed by Eleanor Osborne. The supervising producer was Tracey Samuelson. We had research assistance from Emma Tyrrell. Our staff also includes Alison Craiglow, Greg Rippin, Rebecca Lee Douglas, Morgan Levey, Zack Lapinski, Ryan Kelley, Jasmin Klinger, Lyric Bowditch, Jacob Clemente, Alina Kulman, and Stephen Dubner. As always, thanks for listening.

BAPU: Hey, Siri, am I having a heart attack?

SIRI: I couldn’t say.

BAPU: Siri, is a heart attack good or bad? … Siri?

Read full Transcript