Voice Conversion Attacks and Countermeasures

What Is Voice Conversion?

Voice conversion is a technique of adjusting someone's voice to exactly mimic another person’s timbre, accent, articulation, and other vocal and speech parameters. In most scenarios, voice conversion can both be pre-recorded or applied in real-time thanks to various deep learning solutions.

What Is Voice Conversion Attack?

A typical example of a voice conversion technique

Voice conversion attack (VCA) is a type of spoofing attack, during which a fraudster’s voice is converted into the voice of the victim. Conversion attacks regularly happen in real time and can aim at both Automatic Verification Systems (AVS) and human vis-a-vis.

Voice Conversion Techniques

The adversarial constraint overview on Mel-spectrogram.

In most cases voice conversion consists of two principal elements:

Spectral mapping. This is a transformation technique that converts the source spectrum to its target version. In the case of a VCA it’s used to mimic the target speaker’s timbre.

Prosody. Prosodic features include rhythm, intonations, articulation, voice loudness, and sometimes speech cohesiveness.

However, the effect can be achieved via different means. One of the proposed methods is the timbre-reserved adversarial attack. It employs the adversarial constraint, which simplifies the task of label collection. It’s added throughout the training process, so the adversarial perturbation can get optimized. Eventually, the synthesized voice retains the target's original timbre.

Overview of the spectral mapping technique

To get converted, the audio signal should go through a complex chain that includes encoders, deep learning-based decoder, an upsampling layer, and a convolutional layer. At the very end of the chain, a vocoder is used, often based on a Generative Adversarial Network (GAN) that reconstructs the target’s voice waveform without affecting words spoken by an attacker.

Countermeasures against Voice Conversion Spoofing Attacks

In a way VCA detection overlaps with the detection of voice cloning attacks as they are somewhat similar. There’s a group of methods to prevent them:

Review of Methods to Counter Voice Conversion Attacks

Comparison of the two speech phase modulations (top/middle) to a synthetic one (bottom)

There’s a constellation of techniques aimed at detecting VCA. They include tracking acoustic artifacts left by the vocoder, analysis of the real-speech phase, monitoring of the dynamic speech variability as the synthetic voice tends to show less of it, utterance-level estimation, and others.

A common ASV system with its weakness points outlined

Method Investigating Spectral Slope

As observed, voice conversion techniques trigger a spectral slope shift of an attacker’s voice to that of the target speaker. A detection solution is suggested, which analyzes the successive feature frames shift towards the local maxima of the likelihood function. When that happens, the distance between successive feature vectors gets reduced, but the feature density around the local maxima on the contrary gets bigger.

Using Modified Local Ternary Patterns (MLTP)

MLTP is a basis for a descriptor that is capable of catching both the natural dynamics of a human voice and synthetic artefacts inherent in its synthesized copy. The descriptor is then used to train a BiLSTM classifier that draws a final verdict.

Source Speaker Identification

The idea behind Source Speaker Identification (SSI) is that typical embedding networks form a cluster of the target speaker’s utterances — they are separated from phrases spoken by other people. SSI takes it a step further by learning the speaker embedding space, so the utterances from a fake converted voice will be separated and kept in the target speaker’s subspace.

Voice Conversion Attacks and Countermeasures

What Is Voice Conversion?

What Is Voice Conversion Attack?

Voice Conversion Techniques

Countermeasures against Voice Conversion Spoofing Attacks

Sign up with email

Check your inbox