Demographic Bias in Facial Recognition and Liveness

Experts grapple with the fact that training datasets for facial recognition may reflect the same demographic biases that exist in real life.

Definition & Overview

Although facial recognition has shown steady improvement in accuracy and liveness detection — both passive and active liveness — there is still a lack of inclusion of all racial and gender characteristics and the issue is still addressed in scientific publications occasionally. A number of experiments, including those conducted by Joy Buolamwini and Timnit Gebru in 2018 and 2019, revealed gender/racial unfairness and inaccuracy in facial recognition (see below).

Global distribution of skin color (Encyclopaedia Britannica)

This issue is largely attributed to poorly balanced data that is used for training face liveness and recognition solutions, as well as biased modelling approaches. Some studies raise a concern about deliberate discrimination in the field. At the same time a group of notable face datasets — FairFace, Racial Faces in the Wild, DFDC — have been designed with racial balance in mind.

Joy Buolamwini, computer scientist and digital activist

Countermeasures are suggested in response to the issue and include a debiasing Presentation Attack Detection (PAD) algorithm based on a Multi-Task Cascaded Convolutional Neural Network (MTCNN), applying demographic features removal, using data resampling during the training phase, assembling more racially proportionate datasets, and so on. Facial anti-spoofing with its types, countermeasures, and challenges is expected to be impartial in regard to racial diversity.

Some Notable Cases of Demographic Bias

Racial and gender bias has been observed since the first Face Recognition Vendor Test 2002 was issued by NIST. Concerns regarding a somewhat similar issue of own-race bias in facial recognition were voiced even earlier by Valentine, Chiroro and Dixon in 1995 — just two years after the launch of the FERET program.

In 2018 Joy Buolamwini built a special dataset of 1,270 images of politicians to test gender fairness. The dataset was run through 3 systems by Megvii, Microsoft, and IBM. The test results showed significant gender misidentification: 7% of lighter-skinned females, 12% of darker-skinned males, and 35% darker-skinned females.

Another experiment exploring racial and gender bias in biometric security took place in 2019. MIT researcher Joy Buolamwini’s study revealed that Amazon Rekognition system had a 19% error rate at recognizing genders of lighter-skinned people while a higher error rate of 31% was observed in the recognition of darker-skinned people.

American Civil Liberties Union claimed in 2018 that Rekognition made 28 false matches

In 2020 the same system was tested again by Comparitech. Authors compared headshots of American and British politicians to the images from the Jailbase.com mugshot database. As a result, the system erroneously matched 32 US Congresspersons and misidentified 73 UK politicians. Considering that Amazon Rekognition is employed by law enforcement and the likes of Thorn.org, the test results turned out rather unsatisfactory.

Mugshots retrieved from Jailbase (dot) com

Methodology & Experiments

As a response to this issue, some methodologies are proposed to increase fairness in face recognition systems, while also preserving their integrability.

Individual fairness methodology

The core idea of this approach is to introduce the individual fairness notation, which is presented in the Fairness Through Awareness concept. The idea implies that similar individuals — for example those who share resemblance in biometric features — should be treated similarly. As a main solution a novel fair group-based score normalization method is proposed, which focuses on individuality rather than a certain demographic group.

The method employs vector quantization, namely a K-means clustering algorithm. The solution comprises:

A set of face embeddings: X = (Xtrain ∪ Xtest).
Corresponding identity information: y = (ytrain ∪ ytest).

The K-means cluster is trained on Xtrain, which allows splitting embedding space into K clusters. Along with computing a false match rate threshold and estimating the normalized comparison score, this enables the computation of the corresponding genuine and fake samples pertaining to the same identity.

Genuine and imposter scores of cluster c used for computing an optimal threshold for a false match rate:

$\displaystyle{ gen_c=\{s_{ij} | \mathit{ID(i)}=\mathit{ID(j)}, \mathit{i \neq {j}}, \forall\mathit{i \in {C_c},}\} }$

$\displaystyle{ imp_c=\{s_{ij} | \mathit{ID(i)}\neq \mathit{ID(j)}, \forall\mathit{i \in {C_c},}\}. }$

Normalized score calculation with the cluster thresholds:

$\displaystyle{ \hat{s}_{ij}=s_{ij}-\frac{1}{2}\bigl(\bigtriangleup thr_{i} + \bigtriangleup thr_{j} \bigr) }$

Local-global threshold difference for sample i:

$\displaystyle{ \bigtriangleup thr_{i}=thr_{i}-thr_{G} }$

The GRAD-GPAD Evaluation Framework

Generalization Representation over Aggregated Datasets for Generalized Presentation Attack Detection or GRAD-GPAD has been developed to introduce a scalable generalization approach for detecting facial Presentation Attacks (PAs). In simple terms, such a system helps a researcher discover new patterns, properties, and know-hows of PAD via a common taxonomy of existing datasets.

Schematic representation of the GRAD-GPAD framework

This system is favorable for hosting experiments as it offers reproducible research, a labelling approach for developing new evaluation protocols, new generalization and demographic bias metrics, method performance comparison graphics, and other tools.

PAI distribution + categorization represented via GRAD-GPAD

The method includes two main phases: a) Feature extraction, in which features are retrieved from the input data and preprocessed b) Evaluation, in which filtering and dataset common categorization are used for training/testing over the features selected.

Gender Bias

The GRAD-GPAD method dictates that more balanced datasets should be assembled to mitigate gender bias without jeopardizing accuracy, active or passive liveness, etc. Additionally, the aggregated data and resembling can compensate for the lacking representative data.

Gender distribution in the GRAD-GPAD datasets

Fair normalization method (individual fairness methodology) suggests attenuating the demographic bias — including gender-based bias — with two normalizing steps:

Improving weak demographic groups that are underrepresented
Adjusting strong demographic groups that are presented enough or overrepresented.

This helps achieve a more stable balance between demographic classes as they will eventually match each other's performances.

Another method features an Inclusive FaceNet model — it is used for attribute prediction models learning for multiple gender subgroups with the help of transfer learning.

Age Bias

The GRAD-GPAD approach suggests that all subjects featured in the face anti spoofing and recognition datasets should be separated into three classes:

Young. Age 18-25.
Adult. Age 25-65.
Senior. Age 65 and older.

A closer examination reveals that age distribution is uneven with the adult group being predominant, while the senior group is barely presented at all. Again, dataset aggregation is considered as an optimal way of bias mitigation. Besides, it can additionally benefit from datasets — like MORPH or FG-NET — that study aged face recognition.

Combined samples from MORPH and FG-NET datasets

Race Bias

As observed by Antispoofing.org Wiki, Usage of MTCNN is reported to diminish racial bias in facial recognition. It focuses on joint learning, which allows the network to predict ethnicity of a subject, as well as gender and age, by analyzing the input data. Ultimately, the bias gets reduced.

The joint learning principle implemented in a CNN model (Joint Learning Multi-Loss)

The GRAD-GPAD method outlines 6 skin-tone categories: Light Yellow, Medium Yellow, Brown, Light Pink, Medium Pink Brown, Medium Dark Brown. Their distribution across various datasets is rather uneven, so the method proposes the following solution of extended datasets:

GRAD‐GPAD datasets with extended skin-tone distribution datasets

Demographic Bias in Iris Recognition Systems

An experiment was hosted focusing on gender bias in detecting iris spoofing. It featured a NDCLD-2013 dataset and three solutions: Local Binary Pattern (LBP), MobileNetV3-Small and VGG-16. The experiment showed that there was disparity among genders: the female cohort decisions performed slightly worse than the male with the neural network and much worse with LBP.

Ethical Aspects of Demographic Bias in Facial Liveness

Demographic bias and at times unethical conduct displayed by some companies working in the field have provoked further controversy regarding facial recognition. For instance, it is mentioned that Clearview AI has "unlawfully scraped from social media websites and applications" user photos while staying unaccountable to any regulation. (However, it resulted in heavy fines later.)

As a possible remedy, the UK’s Biometrics and Forensics Ethics Group (BFEG) has established 9 key ethical principles, which should regulate facial recognition solutions. "Avoidance of bias" and "Impartiality" are mentioned among them.