Voice Liveness Detection Systems — Challenges and Solutions

From Antispoofing Wiki

Voice liveness detection serves to prevent spoofing attacks produced with voice replaying, cloning and other similar techniques.

Voice Liveness Detection: Definition, Goals & Methods

Voice liveness detection is a group of security measures, which detect and prevent voice spoofing attacks. These attacks can be characterized as a type of voice phishing. Although scams involving remote verbal communication existed since the earliest days of telephony, voice spoofing is a relatively novel threat. Initially, concerns regarding the issue were raised together with the advent of smart gadgets: as voice is one of the most used modalities for user authentication along with fingerprints, face and iris. The emergence of deepfakes in 2017 reinforced these concerns, showing that a human voice, just like a face, could be replicated realistically with the help of artificial intelligence (AI).

Currently, there are numerous AI-powered tools capable of making lifelike voice replicas: SteosVoice (formerly CyberVoice), RealTalk, Resemble, and many others. They can imitate intonations, accents, vocal timbre and pitch, articulation, and other unique traits native to the human speech apparatus. Broad usage of voice authentication in mobile banking, e-commerce, virtual assistants, online healthcare (Amazon Transcribe Medical), and other spheres further proves how dangerous voice spoofing attacks can be. One such attack carried out in 2020 caused a UAE based company to lose $35 million to fraudsters.

Voice Liveness in Commercial Products

A number of online applications— mobile banking or booking services — employ voice recognition and verification. They are employed to improve customer experience and reduce the usage of PIN-codes, passwords or challenge questions. To ensure the security of voice recognition applications, a number of solutions are proposed.

IDLive Voice

IDLive Voice developed by ID R&D is capable of identifying a synthesized voice "within milliseconds". As reported, ID Live Voice includes algorithms that can detect specific spectral artifacts inaudible to a human ear. Usually, such artifacts are left by speech conversion and usage of a Text-to-Speech generator. A similar approach suggests that short-term spectral features can be computed with the inverted frequency warping scale, as well as overlapped block transformation of filter bank log energies, which together allow detecting discrepancies between live and synthetic voices.

Nuance

VocalPassword by Nuance is a tool that can detect pre-recorded and/or edited voice (cheapfake), which can be used for a spoof attack. Its mechanism is based on intra-session voice variation principle:

  • A speaker makes an utterance to get verified.
  • The system captures an audio sample from the utterance.
  • Then the speaker is prompted to repeat a random part of their phrase.
  • The system compares the received samples, drawing a liveness detection score.

VocalPassword employs VocalPassword’s Utterance Validation engine for automatic speech recognition. Additionally, it is enhanced with ‘fraudster detection’ — a feature, which registers and detects known fraudsters alerting the system about their presence.

VoiceVault

VoiceVault has developed a system against suspicious and fraudulent calls, especially aimed at seniors and individuals who "live far from financial institutions". (According to the FBI, the elderly citizens lost $1 billion to swindlers in 2020 alone.) The solution includes a biometric engine, which is capable of: a) Fending off replay attacks featuring prerecorded voices and b) Detecting liveness of the caller’s voice.

General solutions

Void system can be highlighted among generally applicable solutions.

Void

Developed by the South Korean researchers, Void is a detector/analyzer, capable of preventing replay attacks. The idea is based on signal power distribution analysis over the audible frequency range. Peak pattern identification in low- and high-power frequency ranges helps further evaluate the incoming signal.

Void’s algorithm includes three steps:

  1. Signal transformation.
  2. Feature extraction.
  3. Attack detection performed in real time.

Furthermore, Void computes the following feature types in succession: Low frequencies power features (FVLFP), Signal power linearity degree features (FVLDF), Higher power frequencies features (FVHPF), LPCC features for audio signals (FVLPC).


Void was tested on two datasets — private and public — and showed a 0.3% and 11.6% Equal Error Rate (EER), respectively. Furthermore, if enhanced with a Gaussian Mixture Model that employs Mel-frequency cepstral coefficients (MFCC), Void demonstrates an EER of 8.7% on a public dataset, while being more resource-efficient than a rival system based on deep-learning.

Voice Liveness Systems in Smartphones

For voice liveness detection in smartphones, the following solutions are proposed:

VoiceGesture

VoiceGesture is based on the idea of analyzing articulatory gestures produced by the phoneme sounds of a person. To detect liveness more accurately, a smartphone speaker is triggered during authorization. It will emit a 20kHz tone inaudible or barely hearable to a human ear. The in-built microphone will record both the passphrase and reflections of the tone, after which Doppler shifts in the same 20kHz region will be extracted. The audio data received will be then compared to a user’s audio profile created during the enrollment stage. If the similarity score is bigger than the threshold set by the profile, liveness will be confirmed.

VoiceLive

VoiceLive is a similar method, which analyzes phonemes uttered by a user. Its core idea is to extract the time-difference-of-arrival (TDoA) dynamic of the input phrase and compare it to a passphrase already stored in the memory of the gadget. If the similarity score is higher than the preset threshold, authorization will be successful.

Voice Liveness in Voice Assistants

A few novel approaches have been developed for the Voice Assistants (VAs) as well.

LiveEar

LiveEar is based on the notion that human and synthetic speech have contrasting TDoA values, since phoneme positions in the vocal apparatus are always different. To do the analysis and validation, a model pre-trained with a large quantity of TDoAs is used. It employs techniques like phoneme segmentation, and phase-transform weighted generalized cross-correlation method (GCC-PHAT) for TDoA calculation, etc.

Detection Using Ear Canal Pressure

A study from Montclair State University suggests that liveness of a speaker can be detected by measuring air pressure in the ear canal with a sampling rate of 500 Hz. User’s voice should be recorded simultaneously with it. The method includes steps like signal segmentation based on Hidden Markov Model, resampling achieved with finite impulse response (FIR) and normalization, etc.

Voice Liveness Systems in Voice User Interface in Internet of Things (IoT)

As many modern IoT devices (and vehicles) can interact through voice user interface (VUI), they also require voice liveness detection as an essential component.

VibLive

VibLive explores a laryngeal formant, bone-conducted vibrations and other essential characteristics of a genuine human voice. These features are unobserved in the hi-end audio equipment used in replay attacks. In these attacks, the sound is produced by a diaphragm attached to a voice coil — it generates mechanically stable audio frequencies, which a human voice would not be capable of.

WiVo

WiVo is a two-factor authentication solution, which analyzes and links mouth motion to the phrases uttered by the speaker. It is done with the help of a channel-state information (CSI) approach that requires no additional gadgets. WiVo will use a pair of antennas of an IoT gadget to gather the CSI signals, while also recording speech of the accessor. It includes 4 steps: 1) CSI collection and voice recording, 2) Data preprocessing, 3) Feature selection, and 4) Detection. Similar to other methods, liveness will be validated if the pre-defined score threshold is exceeded.

References

  1. Witcher 3 fan mode ‘A Night to Remember’ used CyberVoice to clone protagonist’s voice acting
  2. Human speech apparatus
  3. Amazon Transcribe Medical
  4. Fraudsters Cloned Company Director’s Voice In $35 Million Bank Heist, Police Find
  5. SteosVoice can synthesize realistic voices and currently is free of charge
  6. Spectral Features for Synthetic Speech Detection
  7. IDVoice™ Verified: AI-driven Voice Verification and Voice Anti-Spoofing
  8. VocalPassword™: voice biometrics authentication
  9. VoiceVault
  10. Senior citizens lost almost $1 billion in scams last year: FBI
  11. Void: A fast and light voice liveness detection system
  12. A Continuous Liveness Detection for Voice Authentication on Smart Devices
  13. VoiceLive: A Phoneme Localization based Liveness Detection for Voice Authentication on Smartphones
  14. LiveEar: An Efficient and Easy-to-use Liveness Detection System for Voice Assistants
  15. Voice Liveness Detection for Voice Assistants using Ear Canal Pressure
  16. Hidden Markov Model
  17. Air-conducted voice formation
  18. WiVo’s architecture