Voice Antispoofing Contests
A series of voice antispoofing contests have been hosted to evaluate the most effective methods of preventing voice spoofing attacks.
Voice Antispoofing Challenges: Goals, Milestones & Results
Voice antispoofing contests have steadily been attracting attention since 2015 when the first ASVspoof challenge was introduced. The goal of the challenge was to identify the best solution that could differentiate real and synthesized speech, thus mitigating spoofing attacks.
Concerns around voice spoofing have been raised together with the onset of smart gadgets: Automatic Speech Verification (AVS) is one of the most used biometric systems, which makes it especially exploitable for the malicious actors. What aggravates the situation even further is that voice is easier to replicate than face, fingerprints or other biometric modalities.
Today a plenitude of applications, devices and even vehicles rely on voice verification: mobile banking, telehealth, transportation, private and public security, Internet of Things, and others. Thus, voice spoofing can lead to an array of detrimental events: from money theft to espionage, smart system sabotage and potentially terrorism.
Among the existing voice antispoofing contests, ASVspoof is a pioneering challenge. Organized by Japan Science & Technology Agency, Eurecom and Academy of Finland, it is an ongoing event that takes place every 2 years since 2015. However, so far it’s unclear whether the challenge will be hosted again in the future.
ASVspoof challenge mostly focuses on Presentation attacks (PAs) produced with speech synthesis and voice conversion. The latter technique allows disguising the attacker’s voice making it sound similar or nearly identical to the victim’s voice. This is achievable with such techniques as Dynamic Kernel Partial Least Squares Regression (DKPLS), Recurrent Temporal Restricted Boltzmann Machine (RTRBM), and other methods.
All in all, four ASVspoof challenges were hosted, with each producing unique results.
The debut challenge was based on the anti-spoofing (SAS) dataset v1.0, which incorporated 3,750 real and 12,625 modified speech samples. The live samples were acquired from 106 individuals without any effects or noises added to them. The spoof samples were altered with synthesis and voice conversion. Additionally, Development and Evaluation datasets were offered with a bigger sample amount.
Modification techniques ranged in sophistication. Among them were simple speech frame selection, adjusting of the first mel-cepstral coefficient (C1) for shifting the source spectrum slope, Hidden Markov Model (HMM) for speaker adaptation with a limited number of phrases, and Festvox synthesis tool. Besides, unknown attacks were featured in the evaluation set.
The follow-up challenge was based on the RedDots dataset, which contained samples recorded with Android devices around the world. Its purpose was to simulate a ‘stolen voice’ situation, in which fraudsters have access to the original voice recordings of a victim and can replay them to fool the system.
The dataset contained three partitions: Development, Training and Evaluation. They contained various utterances and ID passphrases recorded in various acoustic ambiances and with a cavalcade of smartphones.
The challenge showed that AVS performance significantly dipped when a zero-effort impostor attack was replaced with a full-scale replay attack. Primary (S) systems employed Gaussian mixture model (GMM) back-end classifier and constant Q cepstral coefficient (CQCC) features. Baseline (B) systems were baseline replay and nonreplay detectors. The system S01 demonstrated the best result of 6.73% EER.
The third installment of the contest was based on the Physical and Logical Access attack scenarios. In the first case, fraudsters use a Text-to-Speech (TTS) tool that can produce a realistic human voice. In the second case, attackers obtain a person’s voice recorded in any reverberating space and replay it to the verification system.
Consequently, the dataset was based on these two scenarios and comprised Training, Development and Evaluation parts. Attacks were separated into known and unknown as well. Logical Access (LA) dataset was produced with TTS systems that featured a conventional source-filter vocoder, WaveNet-based vocoder, and other components.
Physical Access (PA) dataset comprised speech samples recorded in rooms of varying sizes and reverberant qualities, featuring 3 categories of volunteer speakers. The evaluation results showed that PA scenarios have a bigger EER spread than LA attacks. Plus, PA detection doesn’t rely on fusion strategies for better performance, while LA detection does the opposite.
The fourth challenge focuses on LA and PA scenarios, as well as on Speech Deepfakes (DFs), which are highly realistic copies of real people’s voices. Therefore, the dataset was extended with the counterfeit voice samples created with deep learning. Besides, the 2021 contest was more stimulating, as its datasets contained more telephony artifacts, noises, reverberation types, etc.
In Logical Access, the best system was B03, which gave a minimal Detection Cost Function for the Tandem (t-DCF) of 0.3445. In Physical Access, the best result belonging to the B01 system gives a min t-DCF of 0.9434. In Deepfake, the best performance was 22.38% EER demonstrated by B04.
Speaker Antispoofing Competition
The Speaker Antispoofing Competition was hosted in 2016. It primarily focused on the replay attacks when a pre-recorded voice sample is being replayed to the microphone of a verification system.
The competition’s dataset involved two attack variations that can be described as ‘unknown’:
- First case. Speech samples were recorded with a smartphone to simulate a scenario when a person’s speech is being secretly recorded. Then they were replayed to another smartphone.
- Second case. The original audios were replayed to a smartphone to simulate a situation when the digital voice recordings are simply stolen.
The detection know-hows included various methods: cepstral mean and variance normalization (CMVN), Deep Neural Networks, Bidirectional Long Short-Term Memory (BLSTM), Mel-frequency cepstral coefficients (MFCCs) for extracting speech signal features, etc.
The test results showed that all systems could recognize new (or ‘unknown’) attack types with difficulty.
ID R&D Voice Antispoofing Challenge
ID R&D’s challenge was hosted in 2019 and offered a $7,635 prize pool. The goal was to find an algorithm that could successfully differentiate fake and real voice signals. The training data included 10,323 human speech recordings and 39,679 synthesized spoof samples. 9 teams earned the ‘gold’ with their undisclosed solutions.
Spoofing-Aware Speaker Verification (SASV) Challenge
The challenge began in February 2022, setting a goal of finding the most effective countermeasure solutions (CMs) that can be integrated into ASVs, as well as help develop ensemble and single systems that can detect and reject spoof utterances.
As a result, the concept of a Spoofing-Aware Automatic Speaker Verification (SASV) system was proposed. The SASV-EER metrics were selected for evaluation and participants were provided with the Evaluation and Development protocols, as well as two Baseline solutions.
- VoxCeleb 2.
- ASVspoof 2019 LA train partition.
- ASVspoof 2019 LA development partition.
The best solution by IDVoice had a 0.13 SASV-ERR rate. It is based on two ResNet-34 architecture modifications with 48 and 100 hidden layers and trained with Additive Margin Softmax (AM-Softmax) loss function.
- Introduction to Voice Presentation Attack Detection and Recent Advances
- ASVspoof 2015: Automatic Speaker Verification. Spoofing And Countermeasures Challenge
- Telehealth: Technology meets health care
- Algorithms And Terrorism: The Malicious Use Of Artificial Intelligence For Terrorist Purposes
- Voice Conversion with Deep Learning
- ASVspoof 2015: the First Automatic Speaker Verification Spoofing and Countermeasures Challenge
- Festvox synthesis tool
- The follow-up challenge
- ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection
- ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection
- t-DCF: a Detection Cost Function for the Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification
- Overview of BTAS 2016 Speaker Anti-spoofing Competition
- ID R&D Voice Antispoofing Challenge
- SASV Challenge 2022: A Spoofing Aware Speaker Verification Challenge Evaluation Plan
- Evaluation and Development protocols
- ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification
- AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks
- ID R&D team submission description for SASV Challenge 2022
- Additive Margin Softmax for Face Verification