Burger menu

Voice Conversion Attacks and Countermeasures

Voice Conversion is a technique of transforming someone’s voice via the means of deep learning, so it precisely mimics the target person’s voice.

What Is Voice Conversion?

Real-time voice conversion

Voice conversion is a technique of adjusting someone's voice to exactly mimic another person’s timbre, accent, articulation, and other vocal and speech parameters. In most scenarios, voice conversion can both be pre-recorded or applied in real-time thanks to various deep learning solutions.  

What Is Voice Conversion Attack?

A typical example of a voice conversion technique

Voice conversion attack (VCA) is a type of spoofing attack, during which a fraudster’s voice is converted into the voice of the victim. Conversion attacks regularly happen in real time and can aim at both Automatic Verification Systems (AVS) and human vis-a-vis. 

Voice Conversion Techniques

The adversarial constraint overview on Mel-spectrogram.

In most cases voice conversion consists of two principal elements: 

  1. Spectral mapping. This is a transformation technique that converts the source spectrum to its target version. In the case of a VCA it’s used to mimic the target speaker’s timbre.
  1. Prosody. Prosodic features include rhythm, intonations, articulation, voice loudness, and sometimes speech cohesiveness.
Overview of a voice conversion model

However, the effect can be achieved via different means. One of the proposed methods is the timbre-reserved adversarial attack. It employs the adversarial constraint, which simplifies the task of label collection. It’s added throughout the training process, so the adversarial perturbation can get optimized. Eventually, the synthesized voice retains the target's original timbre.

Overview of the spectral mapping technique

To get converted, the audio signal should go through a complex chain that includes encoders, deep learning-based decoder,  an upsampling layer, and a convolutional layer. At the very end of the chain, a vocoder is used, often based on a Generative Adversarial Network (GAN) that reconstructs the target’s voice waveform without affecting words spoken by an attacker.

Countermeasures against Voice Conversion Spoofing Attacks


In a way VCA detection overlaps with the detection of voice cloning attacks as they are somewhat similar. There’s a group of methods to prevent them: 

  1. Review of Methods to Counter Voice Conversion Attacks
Comparison of the two speech phase modulations (top/middle) to a synthetic one (bottom)

There’s a constellation of techniques aimed at detecting VCA. They include tracking acoustic artifacts left by the vocoder, analysis of the real-speech phase, monitoring of the dynamic speech variability as the synthetic voice tends to show less of it, utterance-level estimation, and others. 

A common ASV system with its weakness points outlined
  1. Method Investigating Spectral Slope

As observed, voice conversion techniques trigger a spectral slope shift of an attacker’s voice to that of the target speaker. A detection solution is suggested, which analyzes the successive feature frames shift towards the local maxima of the likelihood function. When that happens, the distance between successive feature vectors gets reduced, but the feature density around the local maxima on the contrary gets bigger.  

  1. Using Modified Local Ternary Patterns (MLTP)
Overview of the MLTP-based framework

MLTP is a basis for a descriptor that is capable of catching both the natural dynamics of a human voice and synthetic artefacts inherent in its synthesized copy. The descriptor is then used to train a BiLSTM classifier that draws a final verdict.

  1. Source Speaker Identification

The idea behind Source Speaker Identification (SSI) is that typical embedding networks form a cluster of the target speaker’s utterances — they are separated from phrases spoken by other people. SSI takes it a step further by learning the speaker embedding space, so the utterances from a fake converted voice will be separated and kept in the target speaker’s subspace.

Try our AI Text Detector

Avatar Antispoofing

1 Followers

Editors at Antispoofing Wiki thoroughly review all featured materials before publishing to ensure accuracy and relevance.

Article contents

Hide

More from AI Generated Content

avatar Antispoofing Algorithms for Detecting AI-Generated Text

What Are the Main Algorithms for Detecting AI-Generated Texts? There are three main approaches to detect synthesized writing: Virtually all…

avatar Antispoofing The Main AI Generative Models

What Is Generative AI? Generative AI is a type of artificial intelligence based on deep learning. Its purpose is to…

avatar Antispoofing Spoofing Attacks on AI Text Detectors and Defense against Them

What Is a Spoofing Attack on an AI-Text Detector? AI text detector spoofing is a deliberate attempt of presenting a…

avatar Antispoofing Converting Speech and Text into Real Material Objects

Is It Possible to Transform Speech or Text to Real Material Objects? Transforming speech into objects is rather an elusive…