Demographic Bias in Facial Recognition and Liveness

From Antispoofing Wiki

Demographic bias seems to cause controversy in regard to biometric identification and facial recognition in particular.

Definition & Overview

Although facial recognition tends to show steady improvement in accuracy and liveness detection, racial and gender bias is still addressed in the scientific papers occasionally. A number of experiments, including those conducted by Joy Buolamwini and Timnit Gebru in 2018 and 2019, revealed gender/racial unfairness and inaccuracy in facial recognition (see below).

This issue is largely attributed to the poorly balanced data that is used for training face recognition solutions, as well as biased modelling approaches. Some studies raise a concern about deliberate discrimination in the field. At the same time a group of notable face datasets — FairFace, Racial Faces in the Wild, DFDC — were designed with racial balance in mind.

Some countermeasures are suggested in response to the issue. Among them are a debiasing Presentation Attack Detection (PAD) algorithm based on a Multi-Task Cascaded Convolutional Neural Network (MTCNN), applying demographic features removal, using data resampling during the training phase, assembling more racially proportionate datasets, and so on.

Some Notable Cases of Demographic Bias

Racial and gender bias has been observed since the first Face Recognition Vendor Test 2002 was issued by NIST. Concerns regarding a somewhat similar issue of own-race bias in facial recognition have been voiced even earlier by Valentine, Chiroro and Dixon in 1995 — just two years after the launch of the FERET program.

In 2018 Joy Buolamwini built a special dataset of 1,270 images of politicians to test gender fairness. The dataset was run through 3 systems by Megvii, Microsoft, and IBM. The test results showed significant gender misidentification: 7% of lighter-skinned females, 12% of darker-skinned males, and 35% darker-skinned females.

Another experiment regarding the issue took place in 2019. An MIT researcher Joy Buolamwini’s study revealed that Amazon Rekognition system had a 19% error rate at recognizing genders of lighter-skinned people. With the darker-skinned people the same error rate was even higher: 31%.

In 2020 the same system was tested again by Comparitech. Authors compared headshots of American and British politicians to the images from the mugshot database. As a result, the system erroneously matched 32 US Congresspersons and misidentified 73 UK politicians. Considering that Amazon Rekognition is employed by law enforcement and the likes of, the test results turned out rather unsatisfactory.

Methodology & Experiments

As a response, some methodologies are proposed to increase fairness in face recognition systems, while also preserving their integrability.

Individual fairness methodology

The core idea of this approach is to introduce the individual fairness notation, which is presented in the Fairness Through Awareness concept. The idea implies that similar individuals — for example those who share resemblant biometric features — should be treated similarly. As a main solution a novel fair group-based score normalization method is proposed, which focuses rather on individuality than a certain demographic group.

The method employs vector quantization, namely a K-means clustering algorithm. The solution comprises:

  • A set of face embeddings: X = (Xtrain ∪ Xtest).
  • Corresponding identity information: y = (ytrain ∪ ytest).

The K-means cluster is trained on Xtrain, which allows splitting embedding space into K clusters. Along with computing a false match rate threshold and estimating the normalized comparison score, this enables the computation of the corresponding genuine and fake samples pertaining to the same identity.

Genuine and imposter scores of cluster c used for computing an optimal threshold for a false match rate:

[math]\displaystyle{ gen_c=\{s_{i,j} | \mathit{ID(i)}=\mathit{ID(j)}, \mathit{i \neq {j}}, \forall\mathit{i \in {C_c},}\} }[/math]

[math]\displaystyle{ imp_c=\{s_{i,j} | \mathit{ID(i)}\neq \mathit{ID(j)}, \forall\mathit{i \in {C_c},}\}. }[/math]

Normalized score calculation with the cluster thresholds:

[math]\displaystyle{ \hat{s}_{i,j}=s_{i,j}-\frac{1}{2}\bigl(\bigtriangleup thr_{i} + \bigtriangleup thr_{j} \bigr) }[/math]

Local-global threshold difference for sample i:

[math]\displaystyle{ \bigtriangleup thr_{i}=thr_{i}-thr_{G} }[/math]

The GRAD-GPAD Evaluation Framework

Generalization Representation over Aggregated Datasets for Generalized Presentation Attack Detection or GRAD-GPAD has been developed to introduce a scalable generalization approach for detecting facial Presentation Attacks (PAs). In simple terms, such a system helps a researcher discover new patterns, properties, and know-hows of PAD via a common taxonomy of existing datasets.

This system is favorable for hosting experiments as it offers reproducible research, a labelling approach for developing new evaluation protocols, new generalization and demographic bias metrics, method performance comparison graphics, and other tools.

The method includes two main phases: a) Feature extraction, in which features are retrieved from the input data and preprocessed b) Evaluation, in which filtering and dataset common categorization are used for training/testing over the features selected.

Gender Bias

The GRAD-GPAD method dictates that more balanced datasets should be assembled to mitigate gender bias. Plus, the aggregated data and resembling can compensate for the lacking representative data.

Fair normalization method (individual fairness methodology) suggests attenuating the demographic bias — including gender-based bias — with two normalizing steps:

  • Improving weak demographic groups that are underrepresented
  • Adjusting strong demographic groups that are presented enough or overrepresented.

This helps achieve a more stable balance between demographic classes as they will eventually match performances of each other.

Another method features an Inclusive FaceNet model — it’s used for attribute prediction models learning for multiple gender subgroups with the help of transfer learning.

Age Bias

The GRAD-GPAD approach suggests that all subjects featured in the facial datasets should be separated into three classes:

  • Young. Age 18-25.
  • Adult. Age 25-65.
  • Senior. Age 65 and older.

A closer examination reveals that age distribution is uneven with the adult group being predominant, while the senior group is barely presented at all. Again, dataset aggregation is considered as an optimal way of bias mitigation. Besides, it can additionally benefit from datasets — like MORPH or FG-NET — that study effects of aging on facial recognition.

Race Bias

Usage of MTCNN is reported to diminish racial bias in facial recognition. It focuses on joint learning, which allows the network to predict ethnicity of a subject, as well as gender and age, by analyzing the input data. Ultimately, the bias gets reduced.

The GRAD-GPAD method outlines 6 skin-tone categories: Light Yellow, Medium Yellow, Brown, Light Pink, Medium Pink Brown, Medium Dark Brown. Their distribution across various datasets is rather uneven, so the method proposes the following solution of extended datasets:

Demographic Bias in Iris Recognition Systems

An experiment was hosted focusing on gender bias in iris PAD. It featured a NDCLD-2013 dataset and three solutions: Local Binary Pattern (LBP), MobileNetV3-Small and VGG-16. The experiment showed that there was disparity among genders: the female cohort decisions performed slightly worse than the male with the neural network and much worse with LBP.

Ethical Aspects of Demographic Bias in Facial Liveness

Demographic bias and at times unethical conduct displayed by some companies working in the field have provoked further controversy regarding facial recognition. For instance, it is mentioned that Clearview AI has "unlawfully scraped from social media websites and applications" user photos while staying unaccountable to any regulation. (However, it resulted in heavy fines later.)

As a possible remedy, the UK’s Biometrics and Forensics Ethics Group (BFEG) has established 9 key ethical principles, which should regulate facial recognition solutions. "Avoidance of bias" and "Impartiality" are mentioned among them.


  1. Understanding bias in facial recognition technologies
  2. FairFace
  3. Racial Faces in the Wild
  4. DFDC (Deepfake Detection Challenge)
  5. Joy Buolamwini, computer scientist and digital activist
  6. A FairFace dataset sample
  7. Face Recognition Vendor Test 2002: Evaluation Report
  8. An account of the own-race bias and the contact hypothesis based on a ‘face space’ model of face recognition
  9. Facial recognition software is biased towards white men, researcher finds
  10. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification
  11. Facial Recognition Is Accurate, if You’re a White Guy
  12. Gender and racial bias found in Amazon’s facial recognition technology (again)
  13. Amazon's Rekognition software lets cops track faces: Here's what you need to know
  14. Study finds Amazon’s face recognition incorrectly matches 105 US and UK politicians
  15. How To Find Mugshots Online
  17. What is
  18. Fairness Through Awareness
  19. Post-comparison mitigation of demographic bias in face recognition using fair score normalization
  20. Face presentation attack detection. A comprehensive evaluation of the generalisation problem
  21. The GRAD-GPAD framework
  22. InclusiveFaceNet: Improving Face Attribute Detection with Race and Gender Diversity
  23. MORPH Facial Recognition Database
  24. FG-NET
  25. Combined samples from MORPH and FG-NET datasets
  26. Mitigating Bias in Gender, Age, and Ethnicity Classification: a Multi-Task Convolution Neural Network Approach
  27. The joint learning principle implemented in a CNN model (Joint Learning Multi-Loss)
  28. NDCLD-2013
  29. Demographic Bias in Presentation Attack Detection of Iris Recognition Systems
  30. ICO fines facial recognition database company Clearview AI Inc more than £7.5m and orders UK data to be deleted
  31. UK’s Biometrics and Forensics Ethics Group