What Is Voice Cloning, and How Can We Detect It?

Voice cloning is a technology based on machine learning with the goal of seamlessly mimicking a person’s voice.

Voice cloning (VC) is an advanced technology powered by Artificial Intelligence (AI) that is capable of imitating a given person’s voice with uncanny precision. Neural Network architectures employed in the process can mimic subtle nuances, including intonations, regional accent, logopedic quirks, and even emotion.

Voice cloning is acknowledged as the most elusive and yet potentially devastating type of biometric fraud. Scam calls orchestrated with the help of VC are realistic, cheap to produce, and can be highly lucrative. According to the Federal Trade Commission, in 2022, VC-powered spoofing attacks resulted in $2.6 billion of losses. To tackle the dangers posed by this type of fraud, voice antispoofing methods have been deployed in corporate, financial, and other areas.

Voice Cloning Technology and Speech Processing Techniques

Typically, VC includes several stages:

Sampling. Training material is gathered — samples of a target’s voice — that are “fed” into the Generative AI model.
Analyzing. The voice characteristics are studied in a waveform and translated into mathematical values.
Extracting features. Mathematical values that represent unique voice parameters are extracted and used as a training material.
Synthesizing. At this stage, a cloned voice is generated by shaping a noise signal. This is, most probably, white noise, as it contains all spectrum frequencies audible to a human ear. (This is what gives it the name “broadband noise”.)
Finalization. The mimicked voice can be either manipulated with a simple Text-to-Speech synthesizer to recite a given text, or used in a tandem with a voice conversion tool that disguises another person's voice with the cloned one in real time.

Usually, Generative Adversarial Networks (GANs) are used for this task, as they are equipped with a generator and discriminator duo: two components engaged in a “rival game” that allow the model to achieve the most believable results.

Some Voice Cloning Solutions

There is an assortment of VC tools that employ various methods:

Neural Voice Cloning with a Few Samples

Previously, it would have taken a large amount of samples to clone a voice. However, a newer approach featuring speaker encoding and adaptation allows for the same level of cloning with just a few training samples and far less computational resources.

NAUTILUS

NAUTILUS is a one-shot model capable of cloning an “unseen voice,” meaning that it requires just a few seconds of untranscribed audio to imitate a voice. It relies on a multimodal architecture, which includes a backpropagation algorithm, latent linguistic embeddings, speech+text encoders and decoders, as well as a neural vocoder.

Expressive Neural Voice Cloning

This solution suggests controllable voice cloning that preserves pitch and tempo of a speech, as well as its emotion. It relies on latent style tokens that retrieve independent prosodic styles from the training samples, pitch contour that tracks changes in the pitch and its impact over semantic meaning occuring in real time, speaker encoding, etc.

Pitch contours extracted from four intonation samples

Emotional Speech Cloning Using GANs

The mentioned approach combines a neural voice-synthesis tool with an EmoGAN — an adjusted version of CycleGAN that adds an emotional component to synthesized speech.

Cloning Voice Using Very Limited Data in the Wild

The Hierotron-based model includes two components that perform independent functions: one is responsible for modeling the timbre, while the other adds the prosodic features – or, in other words, patterns of tone and emphasis.

Diffusion Probabilistic Modeling

Diffusion probabilistic modeling requires merely 15 seconds of training audio and about 3 minutes of processing time to produce a cloned voice. It relies on a novel architecture with two encoders functioning on their input domain and a shared decoder trained in correspondence to the reverse diffusion.

Diffusion probabilistic model’s architecture

HiFi-GAN Model

The HiFi-GAN approach employs x-vector for target speaker’s characterization, competitive multiscale convolution to boost vocoder’s performance, and one-dimensional depth-wise separable convolution to push inference speed.

Real-Time Voice Cloning System Using Machine Learning Algorithms

The proposed system combines a speaker encoder for embedding creation, synthesizer for generating Mel spectrograms transformed into sound, and a neural vocoder for audio wave conversion.

Personalized Lightweight Text-to-Speech

In this case, a resource-efficient model is presented which relies on learnable structured pruning — it essentially produces a speaker’s model while requiring fewer resources. In turn, this makes it deployable on mobile gadgets.

Demonstration of the neuron pruning process

ZSE-VITS

ZSE-VITS is a zero-shot expressive version of VITS — a Conditional Variational Autoencoder – which also employs adversarial learning that provides end-to-end speech generation. It also utilizes TitaNet, which is a network capable of extracting speaker representations with the help of Squeeze-and-Excitation (SE) layers, 1D depth-wise separable convolutions, etc.

A Real-Time Voice Cloning System with Multiple Algorithms for Speech Quality Improvement

AI-voice solutions often struggle at “reading” long text passages, polluting them with artifacts, mispronunciation, etc. A novel approach proposed in 2023 couples a text determination module to a synthesizer module, while also adding a noise reduction algorithm and a SV2TTS framework to achieve superior audio quality.

Vocoder-generated output in real-time voice cloning

Potential Harm of Voice Cloning

It is hypothesized that VC will present a gamut of serious threats in the future, including disinformation campaigns, phone phishing, circumvention of voice recognition security systems, voicemail attacks, and so on.

Cloned Voice Attack Models

There are several attack modalities exploiting an AI-mimicked voice:

Interaction with a biometric security system. A cloned voice can be used to hack a banking app, secured terminal, digital account, remotely controlled device, and other avenues.
Harming other individuals. This refers to hate speech, discrimination, disinformation, and other similar instances of negative interaction.
Defamation. A cloned voice can be used to produce a false message on behalf of a specific individual or a company, institution, or even governmental entity.

The training samples for sculpting a synthetic voice can be obtained from a plethora of sources: voice notes from a messaging app, secretly recorded phone calls or live conversations, publicly available videos, mobile provider databases, etc.

Countermeasures against Cloned Voice Attack

Detecting VC attacks can be done either manually or by using sophisticated algorithms.

Manual Recognition of Cloned Voice Attacks

A cloned voice can be detected “manually” — without using special solutions — by a few red flags:

No breath is heard during the speech.
Strange artifacts, distortion, and noises.
Shallow background knowledge related to a targeted person.
A highly targeted topic related to sensitive personal details or finances.
Misplaced syllable accent, lack of acceleration and deceleration in speech patterns natural to how humans talk (also known as “unnatural prosody”).

However, if fraudsters go through meticulous preparations, these hallmarks may become less and less noticeable as their technique is honed.

Cloned-Voice Detection Using Higher-Order Spectral Analysis

It is reported that synthesized speech can be spotted via detecting specific artifacts that are hidden from human perception. These artifacts emerge in every cloned voice, no matter which cloning algorithm was used.

BiLSTM model

Bidirectional Long-Short Term Memory Machine (BiLSTM) is an antispoofing approach that features MFCC, GTCC, Spectral Flux, and Spectral Centroid. Together, they form a feature vector responsible for audio signal representation. This helps detect natural speech variation and tell it apart from synthetic artifacts, thus unmasking a spoofed voice.

Bona fide speech compared to a cloned voice sample with vertical line artifacts (bottom)

Using Acoustic Ternary Patterns with Gammatone Cepstral Coefficients Features

Acoustic Ternary Patterns (ATP) and Gammatone Cepstral Coefficients (GTCC) help detect unnatural speech cadence, absence of pauses, wrong syllable accent, mispronounced words, and other signs of poor prosody inherent to synthesized voice.

Single and Multi-Speaker Cloned Voice Detection

This approach suggests using three methods: low-dimensional perceptual features, generic spectral features, and well as end-to-end learned features. These allow for detecting a real human voice by its higher amplitude variability in speech.

CloneAI

CloneAI is a Convolutional Neural Network (CNN) supported by Linear Frequency Cepstral Coefficients and Mel spectrogram that synergistically allows for analyzing speech features and filtering out voices that have subtle unnatural characteristics.

Cloned Voice Detection Experiments and Contests

A number of tests have been orchestrated to see how security models can prevent voice spoofing.

Evaluating Cloned Voice Detection Techniques

In one 2020 experiment, a duo made up of WaveNet and Tacotron Text-to-Speech synthesizer was tested against CNN-powered Automatic Speech Verification, and some other tools. The results showed that antispoofing solutions based on CNN, Gaussian Mixture Model - Universal Background Model, and others can actually show a decrease in their reliability when facing realistic voice cloning.

WaveNet’s architecture featuring dilated casual convolutional layers

Multi-Speaker Multi-Style Voice Cloning Challenge

The M2VoC challenge was hosted in 2021. It featured two competition tracks divided into subtracks where Speech Quality, Speaker Similarity, and Style Similarity were assessed. Mean opinion score (MOS) was used to rate solutions. The results showed that LPCNet — a speech synthesizer based on WaveRNN — is quite efficient as it allowed the top teams to get the highest scores.

Comparison of the Neural Network Model and Human Performance

During the experiment, it was discovered that human listeners have a hard time distinguishing cloned and real voices with a 50% accuracy, which is basically a random chance. In the meantime, antispoofing solutions — such as Baidu’s Deep Speaker based on convolutional layers and ResNet-type blocks — showed a 98.8% immunity to VC attacks.

A number of contests have been devised to test the accuracy of voice liveness antispoofing tools. To learn more about these competitions and their detection methods, read on here.