Voice Antispoofing: Origin, Types and Preventive Techniques

From Antispoofing Wiki

Historical Premiss

Voice spoofing or Automatic Speaker Verification (ASV) attacks are a relatively new threat to biometric recognition systems. Voice recognition technology has been around for almost 70 years — the first VR system dubbed “Audrey” was introduced in 1952 by Bell Laboratories. However, voice spoofing attacks became a major concern for the first time in 2013. Voice antispoofing origin, types and preventive techniques can be traced back to the INTERSPEECH 2013 conference, where these issues were discussed for the first time.

Today, ASV attacks have become considerably more common due to the widespread use of smart gadgets available to millions of users.



Apple’s voice assistant, Siri — introduced in 2011 — led to the emergence of numerous similar services like Google Voice Search or Cortana that could understand human speech and execute commands. As a result, ASV attacks allowed malicious actors to access personal data, credit card details, commercially valuable info and even other devices and appliances controlled by voice assistants like Amazon’s Alexa. Companies and governments are also vulnerable to ASV spoofing. In 2019, a UK based energy company lost $243,000, after impostors made a call simulating the CEO’s voice from the head office situated in Germany.

Terminology

When discussing voice antispoofing, its types, preventive techniques and origins, the following terminology is used:

  • ASV (Automatic speaker verification). A system, which uses biometric info in a person’s voice — timbre, articulation, intonations, lexical behavior — to verify their identity.
  • SSD (Spoof speech detection). A set of measures used to identify and prevent a voice spoof attack.
  • SS (Speech synthesis). A popular type of attack, which employs a computer-simulated voice.
  • VC (Voice Conversion). A spoofing attack, which makes an impostor’s voice as close as possible to the voice of the targeted individual, with the help of filters and other tools.
  • RA (Replay attack). Fraudsters use a pre-recorded sample of the victim’s voice.
  • Impersonation. A type of attack, during which an attacker mimics the victim’s voice tonality, prosodic features, vocabulary, etc.
  • Physical Access. A direct attack method, during which an attacker directly interacts with the sensor of a biometric system. (Microphone).
  • Logical access. An indirect attack method. It aims at vulnerabilities in the code or hardware of a biometric system.


Types of attacks

ASV attacks are considered the easiest to execute among the rest of spoofing attacks. Compared to facial spoofing attacks or fingerprint replication which require effort, skill, and time to sculpt believable copies, creating or obtaining a voice sample is much easier, faster, and cheaper.

Researchers have identified 4 main types of voice spoofing attacks:

  • Speech Synthesis
  • Voice Conversion
  • Replay
  • Impersonation

Each attack has its unique technique and strategies, and can therefore be detected and countered using specific measures.

Speech synthesis

Speech synthesis is an ASV attack that has lately been gaining considerable attention from malicious actors. The core idea behind this attack is simple. Attackers use a speech-to-text generator to produce various pieces of audio data: passwords, commands, requests, names, etc. Most text-to-speech synthesizers sound robotic and cannot be manipulated into mimicking a specific person’s voice. However, the case is different for synthesizers powered by neural networks. A good example is CyberVoice. It is a sophisticated synthesizer that can pick up various subtleties of a human voice: from its timbre to unique intonations, and even accents and speech impediments. CyberVoice became controversial in 2021 when it was used to simulate a highly convincing voice acting for a game mode.

Another example of a highly believable synthesizer is Google Duplex. During a 2018 I/O event, it showed the ability to not only imitate a person’s voice but also create a conversation on its own.

Voice conversion

In essence, Voice Conversion (VC) is a method of using an attacker’s natural voice and making it sound similar to the voice of a targeted person. Previously, the idea of VC was based on parallel voice conversion, which compares acoustic parameters of the source and target voices. It also employs dynamic time warping, and phone-level forth alignment, etc. Currently, the primary method of executing VC is deep learning, which utilizes a variational auto-encoder or sequence-to-sequence neural networks.For example, in 2018 a neural network dubbed Wavenet was used for swapping voice content: while retaining the initially uttered text, it converted the original voice into a new one.

Replay attacks

Origin of voice antispoofing types and preventive techniques were greatly predefined by the replay attacks. These attacks are considered the most simple, effective and widely used method of ASV attack. Malicious hackers collect audio samples of a targeted person’s speech. Later, they will be presented to the voice-recognition system — replayed, in other words. One of the most well-known examples of replay attacks is the avalanche of spam calls that flooded the US and many other countries.During such a call, an impostor would trick the target into saying “Yes”, “No”, pronouncing their full name, and other small speech samples. Later, coupled with other data like credit card details or bank info, these audios would be used to steal money from the target's bank account, request a loan, and commit other forms of fraud. As a side threat, longer speech samples obtained through trickery — for example, a target voicing an opinion on a certain topic — can be used for training a neural network to mimic their voice.

Impersonation

Impersonation is a method of ASV attack that often involves professional impersonators. With this artistic skill, they can emulate characteristics of a victim’s voice — like formant frequencies — to utter necessary phrases.



Studies indicate that impersonation is not a highly effective means of attack: it leads to higher error rates, even when performed by professional and experienced impersonators. But the overall threat of impersonation is yet to be discussed.

Mitigating Voice Spoofing

Countermeasures to ASV spoofing attacks, as proposed by security experts, can be categorized into two groups: Active liveness detection and Passive liveness detection.

Active liveness detection

Active methods require a person to perform a certain action for the system to identify them successfully. In the case of voice anti-spoofing, a user is required to pronounce a passphrase that is randomly generated from a predefined list of words. (Dictionary).

Active Methods include:

  • Audio segmentation. This method employs time mapping between the audio stream and pronounced phrase. As additional tools, Hidden Markov Models (HMM) and Viterbi algorithms are used too. As a result, audio segmentation helps to assess the “correctness” of the uttered phrase and learn if it was really pronounced by the person in question.
  • Audiovisual recognition. Audio-Visual Automatic Speech Recognition (AVASR) can track changes in lip movement to learn if the words are matching it. It uses an anthropometric detector together with the face points distribution model (PDM).
  • Features extraction. An audiovisual method, which analyzes the length of vectors, which store pairs of landmarks points. After that, the visual data is compared to the target's speech to detect possible fake instances.


Passive liveness detection

Passive liveness detection is performed at the system backend. It does not require a user to do any specific actions. In this way, it provides a more seamless and convenient user experience making it the more preferable solution for recognition systems. Passive liveness detection focuses on analyzing the speech spectrogram itself, omitting the visual analysis of a speaker.

Methods include:

  • POCO method. POCO stands for Pop Noise Corpus. This method analyzes the so-called “pop noise” to provide liveness detection. Pop noise, in simple words, is a group of sonic artifacts that are produced by the speaker when they pronounce words with plosive consonants: [p], [b], [k], [g] or simply draw a breath. Speech samples obtained via secretly recording a person’s voice do not contain plosives. It is an artifact, which appears only when a person speaks into the microphone directly. Therefore, fake and stolen speech samples are detected by their lack of plosive consonants.
  • Void. Voice Liveness Detection analyzes cumulative power patterns in acoustic spectrograms, This allows the system to learn if the voice is replayed or not. For instance, VOID can detect natural distortion that is added by the speakers that translate a replayed voice sample.



  • Deep learning. This method employs combining Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). CNN is responsible for feature extraction while RNN is used for modeling long-term dependencies. As a result, the sophisticated process leads to a satisfying voice spoofing prevention rate.


References

  1. A short history of speech recognition.
  2. Interspeech 2013.
  3. Fraudsters Used AI to Mimic CEO’s Voice in Unusual Cybercrime Case.
  4. Prosodic features.
  5. CyberVoice.
  6. Witcher 3 mod uses AI to create new voice lines without Geralt’s original voice actor.
  7. Google just gave a stunning demo of Assistant making an actual phone call.
  8. Introduction to Voice Presentation Attack Detection and Recent Advances.
  9. Wavenet.
  10. Here's how the 'Can you hear me?' phone scam works.
  11. Formant Frequencies.
  12. POCO: a Voice Spoofing and Liveness Detection Corpus based on Pop Noise.
  13. Void: A fast and light voice liveness detection system.
  14. Audio replay attack detection with deep learning frameworks.