Search the Site

Episode Transcript

Katie Ryan’s home office in Pittsburgh, Pennsylvania, is pretty run of the mill.

RYAN: I just have a regular Ikea desk. I have a big TV up on the wall. I have a laptop stand with my laptop on it, and then I have a monitor stand that has two monitors on it. There’s a blanket on the floor for my dog, you know.

But the work she does at this desk is seen by millions of people every week.

RYAN: I’ve done the Super Bowl a handful of times, I’ve done the Olympics many times. I just did the Oscars a couple weeks ago. Any sporting event that you can think of, I’ve probably done it. Any major news event that has happened, I have probably been involved in that somehow. Presidential funerals. Presidential debates. I remember when the Boston Marathon bombing happened, the breaking news was just constant. I think I was on the air writing without a commercial break for something like three and a half hours.

Ryan is a captioner. She writes the text transcripts that appear on your TV screen when you turn on closed captioning. She does this in real-time.

RYAN: Most people think their TV just does it. They don’t realize, that there’s a person like me sitting in a room with headphones on. And people don’t realize that it’s happening live. Like, if I’m writing a news broadcast or a sporting event, maybe I have, like five seconds extra than you do when you’re hearing it. And I have to write it at the same time, and try and keep up with all the speedy talkers that are out there. 

In some ways, it’s a good business to be in. One survey found that 50 percent of Americans — and 70 percent of Gen Z viewers — say they watch content with captions on most of the time. But the industry is also rapidly changing. The nimble fingers of human captioners like Katie Ryan are up against the neural networks of artificial intelligence services.

KARLOVITS: Technology is the key to the future of captioning. But you know, you need people that are looking at the content.

For the Freakonomics Radio Network, this is The Economics of Everyday Things. I’m Zachary Crockett. Today: Closed Captions.

 *      *      *

The term “captions” is often used interchangeably with “subtitles,” but the two are different. Subtitles are used for translation. Captions are designed for people with hearing impairments, and they describe every auditory element — dialogue, sound effects, music, and, sometimes, even background noises.

KARLOVITS: The goal of captioning is to give the user the content of exactly what’s being heard.

That’s Doug Karlovits. He’s a general manager at Verbit, the largest provider of captions in America. He says that, if you’re watching something on TV — either live, or pre-recorded — you can almost always turn on the captions in the device’s settings. But that wasn’t always an option.

KARLOVITS: Really, captions were born for television in 1970. The first prerecorded show ever captioned was “The French Chef” with Julia Child.

The earliest efforts were called open captions, and they were limited to pre-recorded shows. The text was a permanent part of the video. Eventually, a new method called closed captions made it possible for viewers to turn the text on and off. And by the 1980s, thanks to the efforts of the non-profit National Captioning Institute, captions could also be used for live television. Around this time, Karlovits’s father, Joe, saw an opportunity to expand the captioning industry.

KARLOVITS: My father was a court reporter, a stenographer. And he became very interested in computers, and how to take his stenotype and get it translated through a computer into English.

Stenographers are extremely fast typists. On stenotype machines, they can transcribe up to 300 words per minute. Joe began training fellow stenographers to do TV captioning. And in 1986, he founded a company called VITAC, which was later acquired by another company called Verbit.

KARLOVITS: We started out with a local television station in Pittsburgh and eventually grew into the largest provider in North America of captioning.

Today, broadcasters, cable companies, and satellite services are required by federal laws to have captions available for nearly every televised program. This also carries over to much of the media on streaming services online, and most video content in public settings, like courtrooms, hospitals, schools, and sports barssportsbars.

Captions have to be readable, accurate, and inclusive of all audio context. They have to clearly identify each speaker, and, for live broadcasts, like news programs, they appear almost in real-time.

KARLOVITS: In the United States, everything that airs on television should have captions today. Almost every show has captions on it.

Vitac is one of three companies, alongside IBM and ZOO Digital Group, that control around 60 percent of the captioning market. Karlovits says they caption around 500,000 hours of content a year.

KARLOVITS: We work with all the major broadcasters, all the various producers of television programs. Work with all the different universities around the world providing captions for the classroom. On the legal side, we’re working with law firms and court reporting agencies. And on the government side we’ll do anything from town halls to training on all the different things. We also work with sports venues, theaters. So everywhere where words are spoken, there’s the opportunity to add captions.

Much of today’s captioning has shifted from human stenographers to automated tools. In some cases, the captioning service uses a technique called respeaking: A human employee watches a show in a recording booth and carefully recites every word into a special microphone. Voice-to-text software turns their words into a written transcript. In other cases — particularly, with pre-recorded TV shows — technology can be used to generate text from a script. But for live TV — like news broadcasts, Super Bowls, and presidential debates — a human captioner clacking away at a machine is still the most reliable option.

A stenographer gets a live feed of a network’s audio a few seconds before it goes to the general public. They listen through a pair of headphones while typing out the words in shorthand on their stenotype machine. This shorthand goes through processing software on a computer that turns it into text. The text is embedded in a video signal that’s transmitted to the television network through modems and IP connections. And when you press the closed captions button on your remote, a microchip inside of your TV retrieves and displays the captions on screen. It’s a complex process, and networks might pay Verbit anywhere from from $130 to $175 per hour for live human captioning services.

KARLOVITS: So if you have a broadcast show that’s in a 30 minute block, but it may be really only on the air for 24 minutes, they would pay for that on a per minute basis. If you’re doing a live show, you’re paying basically for the times that are booked. Because you don’t know how long those live shows can go.

So, who are these humans who create the captions on TV? And what’s it like to be on the clock during a live broadcast?

RYAN: Sometimes you can’t even get a drink of water. 

That’s coming up.

 *      *      *

Katie Ryan didn’t start out hoping to be a professional captioner.

RYAN: When I was graduating high school, I really didn’t know what I wanted to do with my life. And my Great Aunt Sandy, her sister at the time was an official court reporter in Philadelphia. And Sandy said, “Well, you can type fast on a keyboard. Why don’t you look into stenography?”

Ryan completed a court reporting program at a community college in Pittsburgh and joined VITAC, now Verbit, after graduating. She’s been at the company as a captioner for more than two decades.

In her work, Ryan uses a machine called a stenotype. It has a small screen, and around 20 unmarked keys that look kind of like popsicle sticks. She’s able to type at speeds of up to 300 strokes per minute using a technique called chording.

She presses down on multiple keys simultaneously to phonetically spell out whole syllables, words, and phrases with one motion.

RYAN: Stenography is essentially learning another language. It’s combinations of keys to make words. And so, on the machine, each key has a letter, and then there are combinations of keys that make more letters. P-B would be N. The letter I would be E-U. The letter D would be T-K. And then there are combinations of keys that make words. So “and” would be A-P-B-D. Your hands are on different sides of the keyboard on the machine. Your left hand is prefixes, your right hand is suffixes. And then you have your endings — “ING,” “S,” “ED” — on your right side.

Ryan can spell out entire phrases with just a few keystrokes.

RYAN: A good example would be, like, “Ladies and gentlemen” — that would be good for TV or court. On my machine it would be L-A-I-R-J. So you hit all of those keys at once, and “ladies and gentlemen,” will come out in your computer software.

CROCKETT: In one fell swoop.

RYAN: In one stroke, you get all those words.

Before she goes live, Ryan creates a dictionary full of customized briefs — abbreviations of specific words that she knows will reoccur throughout a broadcast. For the Academy Awards, she’ll program combinations of keystrokes for the title of each nominated movie. For a hockey game, she’ll program every player’s name.

RYAN: Instead of having to write out their name every single time that it’s said, you hit that one combination of keys one time or twice, and then that whole name will come out. Obviously, we have to search ahead of time to find out who, like, your play by play announcer is and who your color analyst is.

But the process doesn’t usually go without a hitch or two. Captioners are human, after all, and they make the occasional mistake. While there is no federally mandated benchmark, the standard for accuracy in the industry is 99 percent — meaning one out of every 100 words might be misspelled or altogether butchered.

Oftentimes, a captioner is aware of a typo; they just don’t have the time to fix it during a high-speed live broadcast.

RYAN: We have the asterisk on my machine, which is the key in the very middle, that can erase a mistake, but nine times out of ten you are not going to catch it fast enough before it already goes out on the air. And then if you try and take it back, it’s just going to garble the captions up. So, it’s better to just — if you make a mistake, just ignore it and keep writing and move past it. And then the faster it moves off the screen, the faster people will forget about it.

Even after 21 years on the job, Ryan has a few recurring issues.

RYAN: I tend to drag my fingers, so sometimes I will catch extra letters when I’m trying to write certain words. I’ll miss keys too. Like, if my fingernails are too long, sometimes I can’t quite hit the keys right.

Sometimes, you might notice the captions pause for a few moments, or go blank. This is likely because the captioner fell off pace and is trying to catch up. This happens most often with news shows, where the banter can be lightning fast. Rachel Maddow, who hosts her own live show on MSNBC, has been clocked talking at up to 270 words per minute — a challenge for even the most seasoned captioner.

RYAN: If you need to just let a sentence go and then catch up again, that’s okay. When you start paraphrasing, though, then you take the risk of presenting the wrong information or turning it into something that they didn’t actually say. And that’s the last thing you want to do. You don’t want to put words in anybody’s mouth.

The goal is to provide a text equivalent of as much of the audio as possible. This can be particularly challenging when multiple people are speaking at once.

RYAN: A lot of times it’ll just be, you know, a couple of words and a dash, and then the next person will be a couple of words and a dash. Sometimes there’s nothing you can do. If they’re just screaming at each other there is nothing you can do. You know, once they figure it out, then you can keep going again.

Doug Karlovits, the general manager at Verbit, says certain TV shows pose more problems than others. Like “The Osbournes,” a reality show from the early 2000s that followed the aging and often incomprehensible rockstar Ozzy Osbourne and his family.

KARLOVITS: The debates around the office on what we thought he was saying on that show was — you know, it was good watercooler conversations. Well, first was, “Is he just putting this on?” Eventually as that show got renewed you realize: no, that’s how Ozzie talks. It was really like, “I think he said this” and then, you know, people would go and, you know, “Come over. Listen to this. What do you think he said?” And, you know, you would just sit there and, “I don’t know. I don’t know what he said. I don’t think he knows what he was saying.”

There are also elements that require interpretation — like how to caption a noise, or a non-verbal vocalization. Some networks and studios are particular. Disney reportedly has specific rules about how R2D2’s mechanical noises should be captioned. Netflix is fond of using the phrase “wet squelching” to describe the sound of monsters in the show “Stranger Things.” For background noises in live captioning, Ryan uses a list of templatized descriptions.

RYAN: We call them parentheticals. So, like, bells tolling, applause, singing, chanting — things like that. You want to try and be descriptive, but also you don’t want to go overboard. 

All of this effort is to ensure that people who are deaf or hard-of-hearing have equal access to media. But captions have found a much broader audience. A 2022 survey by the language learning platform Preply found that half of all viewers now watch media with captions on most of the time. Some have speculated that’s at least partly to do with modern sound mixing, which alternates between loud sound effects and quiet dialogue.

KARLOVITS: Game of Thrones, there was so much background noise occurring in that show that a lot of the people started using captions.

But the most frequent users of captions are now younger people — particularly, Gen-Z. And that has more to do with changes in the media landscape.

KARLOVITS: The younger viewers — they are watching it on their phones. They’re watching it on their iPads. They’re not necessarily listening, but they’re reading it as they’re in class or they’re at work and don’t want to call attention to themselves. 

Some publishers have estimated that up to 85 percent of the videos they post on Facebook are watched on mute. Many short-form videos on social media sites now have captions coded directly into the media file that can’t be turned on or off.

KARLOVITS: That’s because it’s keeping that person who’s looking — it’s keeping their attention longer. 

Some platforms, like YouTube, offer their own tools to creators that use speech recognition to generate captions automatically. Karlovits says artificial intelligence has already fundamentally changed the captioning business. Verbit offers automatic speech recognition and generative AI tools that are trained with diverse language models to pick up on speech patterns. Karlovits says these options cost much less than traditional transcription. But they still aren’t as accurate or precise as a human captioner.

RYAN: Maybe a deaf person is in an area that there’s tornados, and they turn on their local news. We want those people to be able to have captioning that is as accurate and as clean as possible, so they know what to do and they can be safe. I will always advocate for a human captioner to be there to give the best service possible.

CROCKETT: When you watch TV, do you always use the captions?

RYAN: No. Never have captions in my house.

CROCKETT: Really?

RYAN: Never. No. I sit in front of a computer and deal with that all day. I don’t need to worry about it. I’m off the clock.

For The Economics of Everyday Things, I’m Zachary Crockett.

 *      *      *

This episode was produced by me and Sarah Lilley, and mixed by Jeremy Johnston. We had help from Daniel Moritz-Rabson. And thanks to our listeners Owen Roberts and David Kennett for suggesting this topic. If you have an idea for an episode, feel free to email us at everydaythings@freakonomics.com. Our inbox is always open. All right, until next week.

CROCKETT: What if you’re in the middle of a live broadcast and you just really have to pee?

RYAN: It used to be a lot harder for me when I was in the office to, like, be able to run to the bathroom during a break if I needed to. Now from my office to my bathroom is, like, ten steps. So I can make it.

Read full Transcript

Sources

  • Doug Karlovitsgeneral manager at Verbit.
  • Katie Ryan, live steno captioner at Verbit.

Resources

Comments