# Audio Replay Attacks and Countermeasures Against Them

## Definition and Overview

Replay attacks are a type of Presentation Attacks (PAs), in which stolen media files are replayed from a portable device to fool a biometric authentication system. Attack tools include data, which belongs to a legitimate user, but is captured and used to gain illicit access to a system, such as domestic appliances connected to the Internet of Things (IoT).

Replay attacks externally target a security system's sensors: for example the microphone of a home assistance device. Reply attacks should not be confused with injection attacks carried out on the internal, software or hardware-based levels where a piece of malicious data is 'injected' through tampering of the signal stream.

Apart from biometric security systems, replay attacks are also seen in cybersecurity. They are based on the same principle of capturing and retransmitting data previously used for authentication. It can include credit card details, radio signal unlocking a car, authenticated data packets, and so on.

Due to the low expense and effort required, audio replay attacks can represent a viable threat to the Automatic Speaker Verification (ASV) systems, as most of the research is focused on deepfake speech synthesis and conversion.

## Replay Spoofing Attacks

Replay attacks are classified as "low-effort" spoofing, as they do not require specific skills or equipment to orchestrate. Fraudsters simply need to obtain a target's voice recording, in which words featured in a specific passphrase/voice access command are spoken.

Attackers can also use a splicing technique to cut a retrieved voice recording in pieces and rearrange it to create a required phrase or even a single word. Although this scenario can be categorized as simple speech synthesis.

The main aim of replay attack anti-spoofing is to differentiate between a voice signal obtained from a real and legitimate speaker and a voice recording replayed by an intermediary gadget.

Real speech can be modelled as a glottal airflow, which evokes a response from the human vocal tract. Each vocal tract, in turn, has a distinctive physiological constitution, which results in a number of shape, size, and other anatomic inconsistencies.

As for pre-recorded speech, it is modelled as a bona fide speech signal coupled with the impulse response of a gadget used to perform an attack: smartphone, Bluetooth speaker, and others. Propagating environment is also taken into consideration in this case. The model is presented with a formula:

$\displaystyle{ r[n]=s[n]*\eta[n] }$

Interpretation:

• $\displaystyle{ r[n] }$ — replay speech signal.
• $\displaystyle{ s[n] }$ — genuine speech signal convolution.
• $\displaystyle{ \eta[n] }$ — device impulse response together with a propagating environment.

A defining trait of a replayed speech sample is that convolutional and additive distortions, though quite subtle, can be detected. To do that, bona fide speech can be "subtracted" or separated from the replayed speech recording with the cepstral vectors.

## Replay Databases

A number of datasets dedicated to voice replay attacks have been assembled. The first one to feature replay attacks was the AVSpoof corpus, which currently offers three datasets: 2015, 2017, and 2019. One of them incorporates 10 attack scenarios including 4 types of replay attacks: Replay-phone, Replay-laptop, etc. AVSpoof datasets are divided into training, development, and evaluation subsets.

Other notable examples include Realistic Replay Attack Corpus for Voice Controlled Systems (ReMASC Corpus) with the replayed voice commands, Mobile Phone Detection with Modulated Replay Attack dataset (KitsuneV1), which explores modulated replay attacks that employ frequency distortion compensation with an inverse filter, Voice Spoofing Detection Corpus (VSDC), etc.

## Methods Overview

Generally, replay attack anti-spoofing includes two major methods: classifier tuning and feature extraction.

### Approaches Using Classifiers

Classifiers for replay detection include:

• Gaussian Mixture Model (GMM). GMM is an easy-to-deploy method supported with likelihood ratio, magnitude and phase features. It is suggested that two GMMs for bona fide and replayed speech should be trained.
• Support Vector Machines (SVM). An SVM-based system undergoes training on bona fide and attack samples to extract low dimensional continuous vectors known as i-vectors from both sample categories. Based on this knowledge, SVM can successfully differentiate attacks and genuine attempts.
• Neural network approach. A two-layer architecture consisting of a Convolutional Neural Network (CNN) and a Time-Delay Neural Network (TDNN) is proposed. While the first network focuses on frame feature extraction, the second model is responsible for frame-level classification. The posteriors of the input phrases, bona fide and replayed, are estimated to differentiate them.

Interestingly, test results demonstrate that the GMM model, while being the simplest, also shows the best performance.

### Approaches Using Feature Extraction

A feature extraction method merges RP-based feature and Mel-filter bank to detect replayed recordings. Named Mel-scale RP feature, it benefits from both perceptual scaling of the Mel-filter in the range of 60-8000 Hz and phase-based information representation of the RP features. Mel-filter serves for capturing phase information of the RP-based feature.

The next steps include mapping into coordinates on a unit circle with the sine/cosine functions, phase data normalization, and logarithmic scaling take place. Detecting replays, therefore, becomes more effective, due to phase data extraction and application of the Mel-scale.

Additionally, authors of the approach suggest using a gammatone filter instead of Mel scale. The gammatone filter is a linear filter characterized by the impulse response of sinusoidal tone and gamma distribution and delivers a more precise resolution at low frequencies.

## Replay Attacks Detection

A variety of methods have been proposed for detecting replay attacks. Among them are active methods — prompting a user to utter a specific passphrase — and passive methods. The latter includes detection of pop noise, measurement of the landmark points distances in the amplitude spectrum, usage of Mel-frequency cepstral coefficient (MFCCs), and so on.

## Deep Learning in Replay Attack Anti-spoofing

Along with the CNN-TDNM tandem, other architectures are proposed. CNN + Recurrent Neural Network (RNN) is capable of extracting features and modelling the long-term dependencies of a speech sequence. A Light CNN with Max-Feature-Map activations (MFM) and batch padding approach is also suggested.

## Replay Attack Detection Using Frequency Characteristics

High-frequency feature analysis is proposed for replay attack detection. It includes Inverse Mel-Frequency Cepstrum Coefficient (IMFCC) and Linear Prediction Cepstral Coefficients (LPCC). They help to detect artifacts left by the analogue-to-digital conversion when a speech sample is replayed and not spoken. For example, anti-aliasing artifacts are among them.

## Replay Attack Detection Using Noise Characteristics

Analysis of noise characteristics suggests that replayed audios contain multi-layered noise coming from the a) genuine signal itself b) recording environment c) recording device d) playback device. A multitask Deep Neural Network (DNN) can undergo these four training scenarios for both exposing potential attacks through noise detection and classifying noises.

## FAQ

### Replay attack definition

Replay attack is one of the most popular spoofing techniques in biometrics.

Essentially, a replay attack implies intercepting some specific data, which is then used to access a certain system. The data may vary — it can be encrypted files or a biometric template stored in a system akin to Automatic Speaker Verification.

In case of biometric spoofing, malicious actors replay media that belongs to the target. It can be audio recordings, videos featuring face, iris, or other biometric traits, and so forth. However, if the media is digitally manipulated, it should be classified as a Presentation Attack instead. Even though replay attacks are mostly primitive, some of them can bypass biometric security.