Manual Detection of Fake Audio

From Antispoofing Wiki

Definition and Problem Overview

Fake audio, in most cases, refers to a falsified audio data that impersonates the voice of a legitimate speaker. Although it may not be produced for ill-intentioned purposes, synthetic audio files made with voice-cloning tools are often used for fraud and misinformation.

A rudimentary voice-imitating technology appeared in 1800 when Wolfgang von Kempelen introduced his 'speaking machine' — a contraption that consisted of bellows, windchest, mouth funnel, and other parts. The machine simulated human vocal tract and could be played like a musical instrument producing complete sentences in several languages.

In 1976 the first text-to-speech (TTS) technology — Kurzweil Reading Machine — saw the light. An early attempt of realistic voice imitation was made by the company Cerec Croc who cloned George Bush's voice for entertainment and also restored the voice of Roger Ebert who lost it due to a disease. Two years prior to that, the first TTS neural network Google WaveNet was designed.

With the advent of machine learning — namely Generative Adversarial Networks debuting in 2014 — took voice-cloning to the next phase. The new architecture made it possible to reproduce a person's voice, picking up highly unique and subtle prosody features: intonations, speech rhythm, and so on.

Voice deepfakes, due to their realism, are considered a vital threat to both Automatic Speaker Verification (ASV) systems and human individuals. In 2020, a voice spoofing attack helped culprits to cause multimillion losses to a U.A.E.-based company.

Theoretical Basis

Human vocalization is a complex process that is defined by numerous components. Among them are respiration, articulation, phonation, and so forth. Speech is produced through re-arranging structures of the vocal tract — tongue, lips, vocal folds, — while a stream of air is pushed up from the lungs through the larynx where arytenoid cartilages are activated.

Therefore, understanding how the human vocal tract and its physiological limitations work, as well as learning how it creates phonemes, is essential in synthetically reproducing a genuine human voice. An approach suggests that the same is true for differentiating fake and real voice samples.

It is based on a technique used in paleoacousticsa scientific field that studies vocalization of the paleofauna based on the fossilized vocal organs. By analyzing their anatomy, paleontologists are capable of reconstructing approximate sounds that ancient animals produced: recreated examples include dryptosaurus, Utahraptor, etc.

The concept implies that based on the vocal samples retrieved from a recording, it's possible to reconstruct the actual vocal tract of a speaker. While that of a human has a broader and somewhat uneven structural consistency and shape, the computer's 'tract' turns out unnaturally thin and proportionate.

Notable Examples of Audio Deepfakes

Audio deepfakes can be separated into two categories.


Audio deepfake technology can be applied in various fields: movies production, game design, automatic customer assistance, healthcare, and so on. A bright example is online entertainment where unnamed GAN architectures simulate voices of video game characters, politicians, music artists, and so on.

A YouTube channel Vocal Synthesis is dedicated, among all else, to a series of videos, in which a number of the American presidents recite the same Navy Seals Copypasta — a piece of the web folklore satirizing online threats. A similar video features a group of the American presidents performing a cover version of an NWA's song.

A Witcher 3 fan-made mode received attention online. Dubbed A Night to Remember, it featured the voice of the lead character's voice actor Doug Cockle synthesised with CyberVoice that is currently known as SteosVoice.

The mode, perhaps for the first time, brought light to the copyright status of an actor's voice as their intellectual property. Voice-cloning of the game characters has also become regionally popular, with Warcraft characters rapping in Russian language.

A YouTube channel Corridor Crew hosted an experiment with voice hijacking. It was used to create an 'inspirational video' about the culture of various States. A bogus interview between Joe Rogan and Steve Jobs was created by to commemorate the legacy of Apple's founder. During the interview, both synthetic vis-à-vis agreed that computer technology is a "double-edged sword." and promised to get rid of their computers.

In 2018 CereProc recreated John F. Kennedy's speech that was supposed to be delivered on November 22, 1963, in Dallas where the president was assassinated before it could be done. The project was challenging, as the sample material of 831 speeches wavered in quality. To overcome it, authors used de-noising, prosody modelling, and speech segment blending. Kennedy's voice was recreated with 116,777 small phonetic units in total.

Spoofing attacks

At least two cases of audio deepfake attacks are known. The first instance took place in 2019 when an energy company's CEO received a call from the German parent company. Culprits synthesized the German chief executive's voice and successfully requested a $243,000 transfer.

A similar spoofing scenario targeted a company from U.A.E., although culprits contacted a Hong-Kong-based partnering bank. The attack was planned cautiously, as the culprits knew about an upcoming acquisition deal, prepared emails allegedly coming from the coordinating lawyers, which resulted in $35 million getting stolen.

According to Pindrop's CEO, fraud, including phone and voice-based scam, causes $470 million of annual losses.

The Main Ways of Fake Audio Detection

Voice anti-spoofing relies on liveness detection, both active and passive. Manual detection also includes some methods.


This approach requires an accessor to say a passphrase that undergoes analysis. One of the techniques is audio segmentation. It applies time mapping between the audio stream and an uttered phrase to assess genuineness. Another method is Audio-Visual Automatic Speech Recognition (AVASR). With the face points distribution model (PDM) and lip movement tracking, it can spot audiovisual inconsistencies.


A promising solution called IDLive Voice can detect spectral artifacts left by speech conversion and TTS generation, which cannot be caught by a human ear. Voice Liveness Detection or VOID spotlights cumulative power patterns of acoustic spectrograms, which helps detect replayed audios. POCO method analyzes the pop noise intrinsic to real audios — this is a phenomenon that occurs when people pronounce plosive consonants.


It is suggested that a person can detect a voice spoofing attack without special tools. It's vital to pay attention to fricatives/s/, /z/, /f/ — as TTS makes them sound close to hissing noise, phrasal timing and unnaturally long/short pauses, bizarre gliding or monotonous intonations, distortion, frequent erroneous syllable stressing, etc.

Positive Effects of Fake Audio

As in Roger Ebert's case, speech synthesis can restore a person's voice lost because of an illness: rheumatoid arthritis, vocal cord paralysis, etc. A synthetic voice, based on the previously captured audio samples of a subject can be integrated into a neurally controlled speaking device that is capable of decoding "novel sentences from participants' brain activity".


  1. The "Kempelen" speaking machine
  2. Roger Ebert's lost voice was artificially revived with the help of voice-cloning
  3. Raymond Kurzweil Introduces the First Print-to-Speech Reading Machine
  4. Audio Deepfakes: Can Anyone Tell If They’re Fake?
  5. Creating Robust Neural Speech Synthesis with ForwardTacotron
  6. Generative adversarial network by Wikipedia
  7. Generative adversarial network by Wikipedia
  8. Fraudsters Cloned Company Director’s Voice In $35 Million Bank Heist, Police Find
  9. Kurzweil Reading Machine, the first TTS device
  10. Vocal Tract Anatomy and Diagram
  11. Arytenoid cartilage
  12. Human vocal tract
  13. What did dinosaurs sound like?
  14. Dinosaur Vocalization Study (2022) | Cretaceous Era
  15. Dryptosaurus vocalization is one of the hypothetical paleoacoustics reconstructions
  16. Deepfake audio has a tell – researchers use fluid dynamics to spot artificial imposter voices
  17. Vocal Synthesis on YouTube
  18. Six U.S. Presidents read "Fuck Tha Police" by N.W.A (Speech Synthesis) WITH MUSIC
  19. SteosVoice
  20. Lyndon Johnson's audio deepfake reciting the "Navy Seals copypasta"
  21. Rapping in Russian language
  22. Corridor Crew on YouTube
  23. 'Biggest breakthrough I've seen': AI creates Steve Jobs interview with Joe Rogan
  24. Welcome to, a podcast that is entirely generated by artificial intelligence
  25. JFK Unsilenced
  26. Basic phonetic units. Units of phonetics
  27. Fraudsters Used AI to Mimic CEO’s Voice in Unusual Cybercrime Case
  28. Creating Robust Neural Speech Synthesis with ForwardTacotron
  29. IDLive Voice (voice anti-spoofing)
  30. Void: A fast and light voice liveness detection system
  31. POCO method
  32. Is that video real? Fake images and videos are everywhere. Here’s how to spot them
  33. Fricative Consonant Sounds
  34. Synthetic Speech Generated from Brain Recordings