In a nutshell: Could you tell if a voice is AI-generated, even when you know what you're listening to could be deepfaked audio? It might not be as easy as you think. A new study shows that humans are only able to detect artificially generated speech 73% of the time in both English and Mandarin.
The study was performed by Kimberly Mai at University College London and her colleagues, who used a text-to-speech algorithm trained on two publicly available datasets. Fifty deepfaked speech samples were created in both languages to determine if humans could identify the fakes from the real voices.
529 people took part in the study. They listened to a real female speaker reading generic sentences in English or Mandarin interspersed with AI-generated phrases.
The first group listened to 20 voice samples in their native language and had to decide if they were real or fake. Participants picked the right option 73% of the time.
A second group heard 20 randomly chosen pairs of audio clips, one spoken by a human and one by an AI. People were able to spot the deepfake 85% of the time in this test, though having two exact clips for comparison was always going to make this challenge easier.
Deepfaked voices often come with telltale signs such as stilted, monotone speech or unnatural stuttering that identify them as being artificial. However, the accuracy figures only increased slightly after participants were given training on how to recognize the characteristics of AI-generated speech.
"In our study, we showed that training people to detect deepfakes is not necessarily a reliable way to help them to get better at it. Unfortunately, our experiments also show that at the moment automated detectors are not reliable either," said Mai.
"They're really good at detecting deepfakes if they've seen similar examples during their training phase, if the speaker is the same or the clips are recorded in a similar audio environment, for example. But they're not reliable when there are changes in the test audio conditions, such as if there's a different speaker."
It's important to note that the people in the study knew they were listening out for an AI-generated voice; a person not expecting one would likely find it harder to recognize a fake. There have been cases of scammers using cloned voices to call people and trick them into believing they are talking to family, friends, or officials and handing over sensitive data. There's also the concern over some security systems using voice identification.
Mai also said that the algorithms used to create the deepfakes in the study are relatively old, so those generated with newer and future technologies will sound more like the real thing and have less of an uncanny valley effect.
Microsoft have announced their AI "VALL-E"
– Del Walker (@TheCartelDel) January 7, 2023
Using a 3-second sample of human speech, it can generate super-high-quality text-to-text speech from the same voice. Even emotional range and acoustic environment of the
sample data can be reproduced. Here are some examples. pic.twitter.com/ExoS2VWO6d
Back in January, Microsoft researchers announced a new AI that can accurately mimic a human voice from a mere three-second-long audio sample.