IE 11 is not supported. For an optimal experience visit our site on another browser.

Why AI-generated audio is so hard to detect

Dozens of tools and apps have sprung up to try to detect AI-created audio, but they are inherently flawed, experts told NBC News.
Three mouths facing each other with soundwaves between them
AI audio detection tools can’t keep up with the pace of AI innovation.Kelsea Petersen / NBC News

Fake and misleading content created by artificial intelligence has rapidly gone from a theoretical threat to a startling reality. The technology to produce a convincing audio recording of a person speaking is constantly getting better and has become widely available with a simple online search.

The mere existence of the technology, and the difficulty detecting content created by it, is already causing chaos.

In January, a robocall from a fake President Joe Biden targeted Democratic voters in New Hampshire. Roger Stone recently used an AI-detection program in an attempt to distance himself from a recording that appeared to feature his voice. And a high school principal’s union suggested that AI was maybe to blame for a recording in which he appeared to make racist comments. The district is still investigating.

While dozens of tools and products have popped up to try to detect AI-generated audio, those programs are inherently limited, experts told NBC News, and won’t provide a surefire way for anyone to quickly and reliably determine whether audio they hear is from a real person.

Deepfake detection systems work very differently from how human beings listen. They analyze audio samples for artifacts like missing frequencies that are often left behind when audio is programmatically generated. Often, they focus on particular aspects of speech, like how the speaker seems to breathe or how much the pitch of their voice goes up and down.

Reality Defender, a prominent deepfake detection company, says that it uses AI to detect AI. Just as generative artificial intelligence works by training algorithms on massive amounts of real, existing data to produce realistic new media, Reality Defender’s employees feed its algorithm both authentic and AI-generated content. Ben Colman, the company’s CEO, said the company clearly labels what’s real and what’s fake hoping that the system can learn to estimate how likely something is to be AI-generated.

“We never say 100%,” Colman told NBC News. “Our highest probability is 99% because we never have the ground truth. So it’s fully probabilistic,” he said.

The vast range of human voices and languages make that work difficult, Colman said.

“With voices, it’s a population distributed across regions and languages and dialects and age. So we’ve got to think about every single variable,” he said.

The parent company of NBC News, Comcast, is an investor in Reality Defender.

With such an untested and rapidly evolving industry, there are few benchmarks for measuring a deepfake detection tool’s reliability.

But software is an inherently limited way to detect deepfakes, said Patrick Traynor, a University of Florida professor who specializes in computer science and telephone networks.

Most detection programs are trained to identify existing deepfake algorithms, making them a step behind new innovations, he said.

“Machine learning is really good at telling you about something it’s seen before, but it’s not so good about reasoning about things it hasn’t seen,” Traynor said.

“There’s a lot of hype in this space, and I’m extremely skeptical. The problems are so difficult,” he said.

Neil Zhang, a machine learning researcher at the University of Rochester, said it’s difficult to assess how well specific detection tools in the space work given the lack of existing benchmarks, but that the options out there are “better than nothing.”

“There’s a huge disparity in funding between companies racing to make passable deepfakes versus those trying to detect them,” he said. “It’s hard to get funding for detection, very easy to get funding for large-language models and generative AI.”

That’s also reflected in academic research, which moves so slowly that it can’t keep up with how quickly the AI industry evolves. Many tools for deepfake detection — especially in academic fields — rely on old data that doesn’t match the current crop of deepfake production tools, he said.

“These kinds of detection tools can achieve very good performance on certain datasets, but cannot perform that well in the real world,” Zhang said.

Biden’s sweeping executive order regulating AI hopes to address the problem. It tasks the Commerce Department with issuing guidance to American AI companies for how they should “watermark” media that they produce so that it’s easy to tell it isn’t authentic. But such guidance still isn’t public, and it remains to be seen how many tools would follow it.

That regulation, which has yet to take effect, is already behind the industry. There’s a glut of companies that offer text-to-speech services that mimic real voices for free or cheap.

“If you simply search AI-based fake speech, you will get tens of searches right away,” said Vandana Janeja, professor of information systems at the University of Maryland, Baltimore County, of AI-based fake speech tools. “It’s almost criminal that all of these things are out there without any guardrails.”

Hany Farid, a professor at the University of California, Berkeley who specializes in digital forensics, analysis and misinformation, said that while software analysis can help, the best way to reliably identify deepfakes is a combination of expert analysis, reporting on the origins of the audio, and critical thinking about the context of a recording.

Even though many experts don’t see detection methods as reliable, there are still cues that humans can listen for to tell if an audio recording is synthetic. Current deepfakes rarely include a person taking a breath in between words, and they often unnaturally space out each word evenly, unlike the way that real people talk.

“We have to fall back on something simpler,” Farid said. “Who’s posting this? Is it reliable? Does this sound right to you, Joe Biden telling you not to vote? Taylor Swift telling you that she’s giving away cookery? Common sense goes a long way.”