Burger menu
-


Audio Replay Attacks and Countermeasures Against Them

An audio replay attack uses a replayed stolen audio file to trick a voice recognition system

Definition and Overview

Replay attacks are a type of Presentation Attacks (PAs), in which stolen media files are replayed from a portable device to fool a biometric authentication system. Attack tools include data from a legitimate user which retains the liveness of their voice, but is captured and used to gain illicit access to a system, such as domestic appliances connected to the Internet of Things (IoT).

These attacks externally target a security system's sensors, such as the microphone of a home assistance device. Replay attacks should not be confused with injection attacks carried out on the internal, software, or hardware levels where a piece of malicious data is “injected” through tampering of the signal stream.

Some antispoofing researchers refer to replay attacks as an “open sesame” scenario, in which a repeated passphrase grants access to certain valuables
Some antispoofing researchers refer to replay attacks as an “open sesame” scenario, in which a repeated passphrase grants access to certain valuables

Apart from biometric security systems, replay attacks are also seen in cybersecurity. True to their name, they are based on the same principle of capturing and retransmitting data previously used for authentication. This can include credit card details, radio signal unlocking a car, authenticated data packets, and so on.

Due to the low expense and effort required, audio replay attacks can represent a viable threat to Automatic Speaker Verification (ASV) systems, as most research is focused on deepfake speech synthesis and conversion. In turn, this challenge gave rise to the voice antispoofing, and its origin, types and preventive techniques.

Anti-spoofing statistics in voice PAs from 2015 to 2021. Note that portion of replay attacks tends to shrink over the last three years presented
Antispoofing statistics in voice PAs from 2015 to 2021. Note that portion of replay attacks tends to shrink over the last three years presented

Replay Spoofing Attacks

In the field of voice antispoofing, replay attacks are classified as "low-effort" spoofing, as they do not require specific skills or equipment to orchestrate. Fraudsters simply need to obtain a target's voice recording in which a specific passphrase/voice access command is spoken.

Attackers can also use a splicing technique to cut a retrieved voice recording in pieces and rearrange it to create a required phrase or even a single word.

Examples of single/multi-order voice replay attacks targeting IoT devices
Examples of single/multi-order voice replay attacks targeting IoT devices

The main aim of replay attack antispoofing is to differentiate between a voice signal obtained from a legitimate speaker and a voice recording replayed by an intermediary gadget.

Real speech can be modeled as a glottal airflow, or the movement of air along the human vocal tract. Each vocal tract, in turn, has a distinctive physiological constitution, which results in a number of shape, size, and other anatomic inconsistencies.

A high-fidelity portable speaker can be used as a replay attack tool
A high-fidelity portable speaker can be used as a replay attack tool

As for pre-recorded speech, it is modeled as a bona fide speech signal coupled with the impulse response of a gadget used to perform an attack: smartphone, Bluetooth speaker, and others. Propagating environment is also taken into consideration in this case. The model is presented with a formula:

\displaystyle{ r[n]=s[n]*\eta[n] }

  • \displaystyle{ r[n] }

    — replay speech signal.
  • \displaystyle{ s[n] }

    — genuine speech signal convolution.
  • \displaystyle{ \eta[n] }

    — device impulse response together with a propagating environment.
Spectrographic of natural and replayed speech in various acoustic environments
Spectrographic of natural and replayed speech in various acoustic environments


A defining trait of a replayed speech sample is that convolutional and additive distortions, though quite subtle, can be detected. To detect this, the bona fide speech can be separated from the replayed speech recording with cepstral vectors.

Replay Databases

There have been a number of datasets dedicated to voice replay attacks. The first one to feature replay attacks was the AVSpoof corpus, which currently offers three datasets: 2015, 2017, and 2019. One of these incorporates 10 attack scenarios including 4 types of replay attacks: Replay-phone, Replay-laptop, etc. AVSpoof datasets are divided into training, development, and evaluation subsets.

Audio signal from the AVSpoof 2017 dataset
Audio signal from the AVSpoof 2017 dataset


Other notable examples include Realistic Replay Attack Corpus for Voice Controlled Systems (ReMASC Corpus) with the replayed voice commands, Mobile Phone Detection with Modulated Replay Attack dataset (KitsuneV1), which explores modulated replay attacks that employ frequency distortion compensation with an inverse filter, Voice Spoofing Detection Corpus (VSDC), etc.

Methods Overview

Replay attack antispoofing includes two major methods: classifier tuning and feature extraction.

Approaches Using Classifiers

Classifiers for replay detection include:

  • Gaussian Mixture Model (GMM). GMM is an easy-to-deploy method supported with likelihood ratio, magnitude and phase features. For best results, it is suggested that two GMMs be trained — one for bona fide speech, and one for replayed speech. 
  • Support Vector Machines (SVM). An SVM-based system undergoes training on bona fide and attack samples to extract low dimensional continuous vectors known as i-vectors from both sample categories. Based on this knowledge, SVM can successfully differentiate attacks and genuine attempts.
  • Neural network approach. A two-layer architecture consisting of a Convolutional Neural Network (CNN) and a Time-Delay Neural Network (TDNN) is proposed. While the first network focuses on frame feature extraction, the second model is responsible for frame-level classification. The posteriors of the input phrases, bona fide and replayed, are estimated to differentiate them.

Interestingly, test results demonstrate that the GMM model, while being the simplest, also shows the best performance.

Results demonstrated by three classifier approaches
Results demonstrated by three classifier approaches

Approaches Using Feature Extraction

A feature extraction method called the Mel-scale RP feature merges RP-based features and a Mel-filter bank to detect replayed recordings. It benefits from both perceptual scaling of the Mel-filter in the range of 60-8000 Hz and phase-based information representation of the RP features. Mel-filter serves for capturing phase information of the RP-based feature.

Process of Mel-scale RP feature extraction
Process of Mel-scale RP feature extraction

The next steps include mapping into coordinates on a unit circle with the sine/cosine functions, phase data normalization, and logarithmic scaling. Detecting replays, therefore, becomes more effective, due to phase data extraction and application of the Mel-scale.

Additionally, authors of the approach suggest using a gammatone filter instead of Mel scale. The gammatone filter is a linear filter characterized by the impulse response of sinusoidal tone and gamma distribution and delivers a more precise resolution at low frequencies.

Usage of the gammatone filter in the proposed method
Usage of the gammatone filter in the proposed method

Replay Attacks Detection

A variety of methods have been proposed for detecting replay attacks. Among them are active methods — prompting a user to utter a specific passphrase — and passive methods. The latter includes detection of pop noise, measurement of the landmark points distances in the amplitude spectrum, usage of Mel-frequency cepstral coefficient (MFCCs), and so on.

Deep Learning in Replay Attack Antispoofing

Along with the CNN-TDNM tandem, other architectures are proposed. CNN + Recurrent Neural Network (RNN) is capable of extracting features and modeling the long-term dependencies of a speech sequence. A Light CNN with Max-Feature-Map activations (MFM) and batch padding approach is also suggested.

Replay Attack Detection Using Frequency Characteristics

High-frequency feature analysis is also useful for replay attack detection. When speech is replayed in an analog-to-digital fashion (as opposed to being spoken directly), it leaves behind artifacts which can be detected, including by Inverse Mel-Frequency Cepstrum Coefficient (IMFCC) and Linear Prediction Cepstral Coefficients (LPCC). Some examples of this would be anti-aliasing artifacts.

Replay Attack Detection Using Noise Characteristics

Analysis of noise characteristics suggests that replayed audio clips contain multi-layered noise coming from either a) a genuine signal itself, b) the recording environment, c) a recording device, or d) a playback device. A multitask Deep Neural Network (DNN) can undergo these four training scenarios for both exposing potential attacks through noise detection and classifying noises.

The proposed multitask DNN architecture
The proposed multitask DNN architecture
Fabfilter Pro-R is a specialized tool for modulating various acoustic environments with reverberation — it is widely used in music and movie production
Fabfilter Pro-R is a specialized tool for modulating various acoustic environments with reverberation — it is widely used in music and movie production

FAQ

Replay attack definition

Replay attack is one of the most popular spoofing techniques in biometrics.

Essentially, a replay attack implies intercepting some specific data, which is then used to access a certain system. The data may vary — it can be encrypted files or a biometric template stored in a system akin to Automatic Speaker Verification.

In case of biometric spoofing, malicious actors replay media that belongs to the target. These can be audio recordings, videos featuring face, iris, or other biometric traits, and so forth. Even though replay attacks are mostly primitive, some of them can still bypass biometric security.

If the media is digitally manipulated, it would be classified as a Presentation Attack instead. Check out this article to read about Presentation Attacks and methods that can be used to detect them.

References

  1. Open sesame by Wikipedia
  2. Hi-Tech thieves use radio trick to unlock and start your car
  3. A survey on presentation attack detection for automatic speaker verification systems: State-of-the-art, taxonomy, issues and future direction
  4. Examples of single/multi-order voice replay attacks targeting IoT devices
  5. Deepfake audio has a tell – researchers use fluid dynamics to spot artificial imposter voices
  6. Sylvania SP807-B-RED Hi-Fi Bluetooth Portable Speaker with FM Radio (218990B)
  7. Databases: ASVspoof 2019, ASVspoof 2017 V2, ASVspoof 2015
  8. AVSPOOF. Database including 10 types of voice recognition attacks
  9. A Survey on Replay Attack Detection for Automatic Speaker Verification (ASV) System
  10. ReMASC: Realistic Replay Attack Corpus for Voice Controlled Systems
  11. When the Differences in Frequency Domain are Compensated: Understanding and Defeating Modulated Replay Attacks on Automatic Speech Recognition
  12. Voice spoofing detection corpus for single and multi-order audio replays
  13. A Study on Replay Attack and Anti-Spoofing for Automatic Speaker Verification
  14. Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings
  15. What are i-vectors and x-vectors in the context of Speech Recognition?
  16. Gammatone filter bank
  17. Replay attack detection with auditory filter-based relative phase features
  18. Cross-domain replay spoofing attack detection using domain adversarial training
  19. Audio Replay Attack Detection Using High-Frequency Features
  20. Digital Audio 101: What is Aliasing?
  21. Replay spoofing detection system for automatic speaker verification using multi-task learning of noise classes
  22. Fabfilter Pro-R is a specialized tool for modulating various acoustic environments with reverberation — it is widely used in music and movie production
Avatar Antispoofing

1 Followers

Editors at Antispoofing Wiki thoroughly review all featured materials before publishing to ensure accuracy and relevance.

Contents

Hide