Voice anti-spoofing: origin and methods

Historical Premiss

Voice spoofing or Automatic Speaker Verification (ASV) attacks are a relatively new threat to biometric recognition systems. Voice recognition technology has been around for almost 70 years — the first VR system dubbed “Audrey” was introduced in 1952 by Bell Laboratories. However, voice spoofing attacks became a major concern for the first time in 2013. Voice antispoofing origin, types and preventive techniques can be traced back to the INTERSPEECH 2013 conference, where these issues were discussed for the first time.

Today, ASV attacks have become considerably more common due to the widespread use of smart gadgets available to millions of users.

A voice replay attack using a microphone and recording application — Figure. Steps for a voice replay attack

Apple’s voice assistant, Siri — introduced in 2011 — led to the emergence of numerous similar services like Google Voice Search or Cortana that could understand human speech and execute commands. As a result, ASV attacks allowed malicious actors to access personal data, credit card details, commercially valuable info and even other devices and appliances controlled by voice assistants like Amazon’s Alexa. Companies and governments are also vulnerable to ASV spoofing. In 2019, a UK based energy company lost $243,000, after impostors made a call simulating the CEO’s voice from the head office situated in Germany.

Terminology

When discussing voice anti-spoofing, its types, preventive techniques and origins, the following terminology is used:

ASV (Automatic speaker verification). A system, which uses biometric info in a person’s voice — timbre, articulation, intonations, lexical behavior — to verify their identity.
SSD (Spoof speech detection). A set of measures used to identify and prevent a voice spoof attack.
SS (Speech synthesis). A popular type of attack, which employs a computer-simulated voice.
VC (Voice Conversion). A spoofing attack, which makes an impostor’s voice as close as possible to the voice of the targeted individual, with the help of filters and other tools.
RA (Replay attack). Fraudsters use a pre-recorded sample of the victim’s voice.
Impersonation. A type of attack, during which an attacker mimics the victim’s voice tonality, prosodic features, vocabulary, etc.
Physical Access. A direct attack method, during which an attacker directly interacts with the sensor of a biometric system. (Microphone).
Logical access. An indirect attack method. It aims at vulnerabilities in the code or hardware of a biometric system.

Figure. Speech verification system in action

Types of attacks

ASV attacks are considered the easiest to execute among the rest of spoofing attacks. Compared to facial spoofing attacks or fingerprint replication which require effort, skill, and time to sculpt believable copies, creating or obtaining a voice sample is much easier, faster, and cheaper.

Researchers have identified 4 main types of voice spoofing attacks:

Speech Synthesis
Voice Conversion
Replay
Impersonation

Each attack has its unique technique and strategies, and can therefore be detected and countered using specific measures.

Speech synthesis

Speech synthesis is an ASV attack that has lately been gaining considerable attention from malicious actors. The core idea behind this attack is simple. Attackers use a speech-to-text generator to produce various pieces of audio data: passwords, commands, requests, names, etc. Most text-to-speech synthesizers sound robotic and cannot be manipulated into mimicking a specific person’s voice. However, the case is different for synthesizers powered by neural networks. A good example is CyberVoice. It is a sophisticated synthesizer that can pick up various subtleties of a human voice: from its timbre to unique intonations, and even accents and speech impediments. CyberVoice became controversial in 2021 when it was used to simulate a highly convincing voice acting for a game mode.

Another example of a highly believable synthesizer is Google Duplex. During a 2018 I/O event, it showed the ability to not only imitate a person’s voice but also create a conversation on its own.

Voice conversion

In essence, Voice Conversion (VC) is a method of using an attacker’s natural voice and making it sound similar to the voice of a targeted person. Previously, the idea of VC was based on parallel voice conversion, which compares acoustic parameters of the source and target voices. It also employs dynamic time warping, and phone-level forth alignment, etc. Currently, the primary method of executing VC is deep learning, which utilizes a variational auto-encoder or sequence-to-sequence neural networks. For example, in 2018 a neural network dubbed Wavenet was used for swapping voice content: while retaining the initially uttered text, it converted the original voice into a new one.

Replay attacks

Origin of voice antispoofing types and preventive techniques were greatly predefined by the replay attacks. These attacks are considered the most simple, effective and widely used method of ASV attack. Malicious hackers collect audio samples of a targeted person’s speech. Later, they will be presented to the voice-recognition system — replayed, in other words. One of the most well-known examples of replay attacks is the avalanche of spam calls that flooded the US and many other countries. During such a call, an impostor would trick the target into saying “Yes”, “No”, pronouncing their full name, and other small speech samples. Later, coupled with other data like credit card details or bank info, these audios would be used to steal money from the target's bank account, request a loan, and commit other forms of fraud. As a side threat, longer speech samples obtained through trickery — for example, a target voicing an opinion on a certain topic — can be used for training a neural network to mimic their voice.

Impersonation

Impersonation is a method of ASV attack that often involves professional impersonators. With this artistic skill, they can emulate characteristics of a victim’s voice — like formant frequencies — to utter necessary phrases.

Face muscle movement and formant frequencies of different sounds are detected — Example of formant frequencies

Studies indicate that impersonation is not a highly effective means of attack: it leads to higher error rates, even when performed by professional and experienced impersonators. But the overall threat of impersonation is yet to be discussed.

Mitigating Voice Spoofing

Countermeasures to ASV spoofing attacks, as proposed by security experts, can be categorized into two groups: Active liveness detection and Passive liveness detection.

Active liveness detection

To detect liveness, active methods require a person to perform a certain action for the system to identify them successfully. In the case of voice anti-spoofing, a user is required to pronounce a passphrase that is randomly generated from a predefined list of words. (Dictionary).

Active Methods include:

Audio segmentation. This method employs time mapping between the audio stream and pronounced phrase. As additional tools, Hidden Markov Models (HMM) and Viterbi algorithms are used too. As a result, audio segmentation helps to assess the “correctness” of the uttered phrase and learn if it was really pronounced by the person in question.
Audiovisual recognition. Audio-Visual Automatic Speech Recognition (AVASR) can track changes in lip movement to learn if the words are matching it. It uses an anthropometric detector together with the face points distribution model (PDM).
Features extraction. An audiovisual method, which analyzes the length of vectors, which store pairs of landmarks points. After that, the visual data is compared to the target's speech to detect possible fake instances.

Audio and facial elements of a video can be separated and compared to detect liveness — Audio segmentation process

Passive liveness detection

Passive liveness detection is performed at the system backend. It does not require a user to do any specific actions. In this way, it provides a more seamless and convenient user experience making it the more preferable solution for recognition systems. Passive liveness detection focuses on analyzing the speech spectrogram itself, omitting the visual analysis of a speaker.

Methods include:

POCO method. POCO stands for Pop Noise Corpus. This method analyzes the so-called “pop noise” to provide liveness detection. Pop noise, in simple words, is a group of sonic artifacts that are produced by the speaker when they pronounce words with plosive consonants: [p], [b], [k], [g] or simply draw a breath. Speech samples obtained via secretly recording a person’s voice do not contain plosives. It is an artifact, which appears only when a person speaks into the microphone directly. Therefore, fake and stolen speech samples are detected by their lack of plosive consonants.
Void. Voice Liveness Detection analyzes cumulative power patterns in acoustic spectrograms, This allows the system to learn if the voice is replayed or not. For instance, VOID can detect natural distortion that is added by the speakers that translate a replayed voice sample.

Cumulative distribution of spectral power density over frequencies (Void method)

Deep learning. This method employs combining Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). CNN is responsible for feature extraction while RNN is used for modeling long-term dependencies. As a result, the sophisticated process leads to a satisfying voice spoofing prevention rate.

A Pop filters is used to block plosive and breath sounds and enhance audio quality — Pop filters prevent plosive and breath sounds from decreasing the quality of audio records

FAQ

Is it possible to detect an artificial voice?

Voice liveness detection is capable of differentiating live and synthetic voices.

Artificial voices are known to be used in Presentation Attacks (PAs). As a response, an assortment of methods is proposed capable of spotting a synthetic voice. They focus on analysis of the liveness signals — vocal tone, depth, inconsistencies — that can be observed in a genuine human voice only.

A number of solutions are dedicated to the issue: VoiceLive, Nuance, Void, VoiceGesture, and others. They employ a diverse repertoire of spoof-detecting techniques: power distribution analysis, time-difference-of-arrival (TDoA) dynamic extraction, phoneme segmentation, and even ear canal pressure detection.

What are the main datasets for voice liveness antispoofing?

AVSspoof dataset series is a leading example among voice antispoofing databases.

The most notable voice spoofing datasets accompany the ASVspoof challenge — a pioneer competition dedicated to voice biometric security. The first instance of the competition offered a dataset based on the SAS corpus, which comprised 9 attack techniques based on speech synthesis and voice conversion.

The follow-up competition introduced the RedDots dataset consisting of voice samples recorded with Android gadgets to stage a "stolen phone" situation and accompanying replay attacks. ASVspoof 2019 contains samples emulating Physical and Logical Access attacks. And the latest dataset offers deepfake voice samples.

What are the main types of voice spoofing attacks?

Four voice spoofing attack types are studied and described.

Current researchers highlight the following AVS spoofing attack modalities: impersonation, replay attack, voice synthesis, and conversion. 1) Impersonation is usually performed by human culprits and fails to spoof Automatic Speaker Verification (ASV). 2) Replay attacks use pre-recorded, stolen or retrieved through social engineering target’s voice samples. 3) Voice synthesis allows realistically cloning a person’s voice with deep learning. 4) Voice conversion also utilizes deep learning to transform culprit’s voice, as they speak, into that of a target in real time. Voice anti-spoofing seeks to prevent them.

What are voice deepfakes?

A voice deepfake is a synthesized voice recording that does not exist in reality.

A voice deepfake is a synthesized voice recording that never took place in reality. It can be produced with sophisticated text-to-speech synthesizers such as Adobe Voco and CyberVoice. The former is an unreleased app dubbed the “Photoshop for audio” while the latter is a voice-cloning tool, based on neural networks. Another method of producing a vocal deepfake is called parallel voice conversion. It allows a perpetrator to mask their own voice with that of a targeted person.

Modern voice deepfake tools are highly accurate at mimicking of intonations and voice timbre.

How does liveness work with voice biometrics?

Voice biometrics technology, while being quite an old concept, has seen a huge development in the last decade.

The process of voice biometrics primarily relies on the fact that each person’s voice is highly distinctive due to how the human body produces sound. To analyze and verify the voice of a real user, voice biometrics software analyzes several key features of human speech including timbre, articulation, intonations, lexical behavior as well as spectral features using different techniques of artificial intelligence. However, voice biometrics is vulnerable to spoofing attacks like synthesized or pre-recorded speech. The liveness detection is the technology specifically designed to mitigate such types of attacks. The voice liveness (or anti-spoofing) module works in conjunction with voice biometrics with the aim to detect synthesized speech, pre-recorded or distorted audios, preventing third-party access to the secured data.

What is text-dependent voice recognition?

Text-dependent voice reocgnition is based on verifying uttered passphrases.

Text-dependent voice recognition is a verification system that employs a specific text or a phrase (passphrase), uttered by the user to grant them access or to deny it to a potential trespasser.

To initiate text-dependent voice recognition, the system often requires its user to pronounce several words and short phrases to gather the required information on the unique features of the user’s voice. This data is then employed to ensure the exclusive access of a legitimate owner. Text-dependent liveness detection systems are becoming widely used in conjunction with other recognition systems to securely verify the end user.

What is text-independent voice recognition?

Text-independent voice recognition does not rely on specific passphrases collected from the user.

Text-independent voice recognition system is frequently employed by call centers or support services of various organizations to verify the identity of a caller. As no particular phrase is used, this system allows a more flexible approach and is becoming more common in different applications like smart speakers, electric cars, etc. Remote verification using text-independent voice recognition might be used in conjunction with voice anti-spoofing technique to protect the system against spoofing attacks. However, since text-independent antispoofing system is more flexible it is also more challenging to maintain.

Are there any voice anti-spoofing contests?

Voice antispoofing is presented in several contests.

Voice antispoofing competitions started in 2015 when an ongoing event — ASVspoof challenge — was hosted for the first time. Over the years, the competition took place three more times and the fifth installment was scheduled for 2022. Each ASVSpoof challenge was accompanied by a unique dataset and had individual goals.

The Speaker Antispoofing Competition and Spoofing-Aware Speaker Verification (SASV) Challenge are two other notable competitions dedicated to ASV security. ID R&D’s challenge is also an eminent event, which provided 10,323 samples of human voice recordings for training.

Voice Antispoofing: Origin, Types and Preventive Techniques

Historical Premiss

Terminology