Human Performance in Face Liveness Detection

From Antispoofing Wiki

Human performance in face liveness recognition is a source of controversy and requires proper scientific evaluation.

People and machines in face recognition

Machine-operated systems are quicker and more productive at performing repetitive tasks. Facial recognition is one such system, where a vast number of faces should be recognized instantly. For instance, American Airlines introduced face scanners at their terminals to speed up the boarding process and service up to 500,000 passengers daily.



Human performance in face liveness detection has been a topic of controversy prompting the question, whether people should partake in the face liveness recognition or not. As a result, a series of tests and experiments were conducted where humans and machines were put in a competition. Standard performance metrics APCER and BPCER were used to assess human performance in one such test. The test results showed that human efficiency displays a higher error rate. At the same time, it is proposed that the human observer cannot be solely excluded from decision-making. A number of companies and bodies today, rely on face verification. Among them are Airbnb, MasterCard, Australia Bank, Interpol, and others. Therefore, accurate face liveness recognition and the question of human performance is a pressing issue in that area.

Humans vs. Computers in Liveness Detection

Face manipulations are becoming more elaborate, posing a greater threat. Moreover, producing deepfakes is now possible, even for people with no technical background. Therefore, the number of attempts at face fraud is expected to grow considerably in the foreseeable future. Experts highlight strong and weak aspects of both Human and Computer face liveness detection. According to a study, fake images and synthesized media often have "limited naturalness" which makes it easier for them to be spotted by a naked eye. At the same time, some face manipulations can be highly complex, even exceeding the quality of genuine images. In this case, human inspection fails to detect them and only machine algorithms can expose fake media.

Human detection

Researchers point to four main aspects that make it easier for a human to detect a spoof image:

  • Segmentation. In most cases, a human will reject a photo/video sample as fake, if it lacks quality. Heavy pixelation, poor lighting, visual noises — all create a doubt factor.
  • Face blending. Morphing multiple faces into one is a type of face manipulation, often used to spoof border control. While difficult to detect, such media, if poorly produced, still retains a number of artifacts and blunders visible to a human eye. Additionally, it can fail to meet the standard ID photo requirements.
  • Fake faces. Synthesized faces made with the Generative Adversarial Network (GAN) can look extremely realistic. At the same time, computing power and knowledge required for this operation may be unavailable to low-level con-artists. As a result, poorly produced media can often cause self-exposure.
  • Poor synchronization. Temporal inconsistencies in a fake videos quickly attract attention. Poor lip-synching, audio-video mismatches, strange facial expressions — can easily reveal a fake video.



Interestingly, one of the ways to spot a fake video is by paying attention to the presented person’s behavior. Observing the so-called accompanying behavior— facial expressions, gestures, and body language — is a way in which humans inherently identify each other. If there is a certain degree of unnatural behavior in the footage, the human brain can discern it, especially if situational context is provided.


Computer Detection

Computer detection remains superior when compared to human performance in fake face detection. Though sometimes, computers can show a high error rate as well: One study quotes that "several computer algorithms performed with high error rates" when trying to detect morphed images.

Computer detection relies on a number of techniques:

  • Data modalities detection. This method includes analysis of the following parameters: audio spectrograms, video spatio-temporal features, and audio-video inconsistencies. They can detect artifacts left by processing tools and phoneme-viseme mismatches etc.
  • Temporal sequential analysis. This method is a powerful tool for detecting deepfakes. It employs the Open-CL language and the temporal-based C-LSTM model. Together they extract frames from the source video, to check which of them were used to conduct face swapping.
  • Heart rate estimation. This method involves a detector that is capable of analyzing the heart rate of a person presented in a video. It registers face illumination and oxygen levels to analyze slight differences in skin colorization invisible to a naked eye. As a result, the system can spot a "synthetic person".
  • Illumination-based analysis. This method relies on a simple technique: flashing randomly generated colors and verifying the light reflecting off the subject’s face. Then, by employing linear regression models and a system of cameras and screens, the method verifies the timing. Among other things, it helps to identify a person’s face shape.
  • MAD. Morphing Attack Detection (MAD) is based on three descriptors: gradient-based, texture-based, and descriptors learned by a deep neural network. They serve to extract features from the image in question to spot clues left by the morphing procedure.

Since, Anti-spoofing is a new and constantly evolving area, apart from the methods discussed above, many new techniques and technologies are constantly under research.

Human vs. Machine Performance Experiments

ID R&D’s Experiment

An experiment conducted at ID R&D Inc. was aimed at determining how efficiently people can handle face liveness recognition. During the test, the standard Presentation Attack Instruments (PAIs) were used: 2D and 3D masks, printed photos, cutouts, and displayed images. The experiment revealed the performance of people to be generally much worse than machines in face liveness detection.

APCER-based performance of the human examinees

ATTACK TYPE 2D MASK 3D MASK PRINTED CUTOUT PRINTED FOTO DISPLAY
APCER 2,04% 2,35% 2,04% 30,34% 15,04%

The BPCER-based results — in other words rejected pristine images — also showed a considerable error rate of 18.28%.



The second part of the experiment focused on collective feedback. A group of 17 people were challenged to identify fake images — their decisions were based on a majority vote. In this case, the error rate was lower. However, computer-based detection still outperformed human detection.

APCER-based performance of 17 participants is shown below:

ATTACK TYPE 2D MASK 3D MASK PRINTED CUTOUT PRINTED FOTO DISPLAY
APCER 0,15% 0,05% 0,00% 27,47% 8,7%

Experiment with Images Synthesized by Generative Adversarial Networks

A similar experiment was collaboratively conducted by some institutions including the University of Berkley. A dataset of human faces was used, only half of which were real. (The other half was synthesized with a Generative Adversarial Network StyleGAN2).



The test results were unsatisfactory: average performance by the examinees was at 50%, which is close to blind luck. In the second stage of the test, participants were trained and showed better results, 60%. However, the accuracy of human detection was still low.

Experiment with Synthesized Faces in Adobe Photoshop

Another research involved a popular Adobe Photoshop application. Using it, a number of fake new faces were created. Additionally, the stimuli were diversified with 50 real photos altered by a professional artist. Similar to previous experiments, the results were not in favor of the human participants, showing only 53.5% performance rate. Later, the authors of the study proposed the Dilated Residual Network system (DRNs), which showed an Average Precision rate of 99.8%.

Experiments Conducted at School of Psychology, University of Lincoln

The University of Lincoln conducted an experiment, during which volunteers were tasked to tell real images apart from morphed images. Prior to the test, participants received a brief training. The test results were quite low with the control group scoring a 56% performance rate and the training group getting only 52.1%. Three more tests were conducted, also showing low human performance.


Experiments with Human Crowds, Machines, and Machine-informed Crowds

A study held at MIT showed somewhat optimistic results regarding human performance. It showed that during Experiment 1, 82% of examinees managed to outperform the leading deepfake detection model. In Experiment 2, only 13-37% of participants were capable of doing the same, showing low repeatability of the results.



The study also indicated that for successful detection by a human observer, it is necessary to be informed about the context of a deepfake media. (Political videos are mentioned as one of the best contextual examples).

References

  1. American Airlines
  2. Facial recognition scanners are already at some US airports. Here's what to know
  3. Face blending technique used for making a fake ID
  4. Human or machine: ai proves best at spotting biometric attacks
  5. Future Trends in Digital Face Manipulation and Detection
  6. Interpol
  7. Deepfakes ranked as most serious AI crime threat
  8. Face morphing attacks: Investigating detection with humans and computers
  9. Demo of a face liveness detection process
  10. How can you tell if another person, animal or thing is conscious? Try these 3 tests
  11. A poor quality fake produced in Photoshop
  12. A C-LSTM Neural Network for Text Classification
  13. DeepFakes Detection Based on Heart Rate Estimation: Single- and Multi-frame
  14. Face Flashing: a Secure Liveness Detection Protocol based on Light Reflections
  15. A fake face generated with Thispersondoesntexist.com
  16. Real and fake faces mingled for the ID R&D test
  17. Synthetic Faces: how perceptually convincing are they?
  18. Real and GAN-synthesized faces used in the Berkley’s test
  19. Detecting Photoshopped Faces by Scripting Photoshop
  20. Deepfake Detection by Human Crowds, Machines, and Machine-informed Crowds