# Demographic Bias in Facial Recognition and Liveness

Demographic bias seems to cause controversy in regard to biometric identification and facial recognition in particular.

## Definition & Overview

Although facial recognition tends to show steady improvement in accuracy and liveness detection, racial and gender bias is still addressed in the scientific papers occasionally. A number of experiments, including those conducted by Joy Buolamwini and Timnit Gebru in 2018 and 2019, revealed gender/racial unfairness and inaccuracy in facial recognition (see below).

This issue is largely attributed to the poorly balanced data that is used for training face recognition solutions, as well as biased modelling approaches. Some studies raise a concern about deliberate discrimination in the field. At the same time a group of notable face datasets — FairFace, Racial Faces in the Wild, DFDC — were designed with racial balance in mind.

Some countermeasures are suggested in response to the issue. Among them are a debiasing Presentation Attack Detection (PAD) algorithm based on a Multi-Task Cascaded Convolutional Neural Network (MTCNN), applying demographic features removal, using data resampling during the training phase, assembling more racially proportionate datasets, and so on.

## Some Notable Cases of Demographic Bias

Racial and gender bias has been observed since the first Face Recognition Vendor Test 2002 was issued by NIST. Concerns regarding a somewhat similar issue of own-race bias in facial recognition have been voiced even earlier by Valentine, Chiroro and Dixon in 1995 — just two years after the launch of the FERET program.

In 2018 Joy Buolamwini built a special dataset of 1,270 images of politicians to test gender fairness. The dataset was run through 3 systems by Megvii, Microsoft, and IBM. The test results showed significant gender misidentification: 7% of lighter-skinned females, 12% of darker-skinned males, and 35% darker-skinned females.

Another experiment regarding the issue took place in 2019. An MIT researcher Joy Buolamwini’s study revealed that Amazon Rekognition system had a 19% error rate at recognizing genders of lighter-skinned people. With the darker-skinned people the same error rate was even higher: 31%.

In 2020 the same system was tested again by Comparitech. Authors compared headshots of American and British politicians to the images from the Jailbase.com mugshot database. As a result, the system erroneously matched 32 US Congresspersons and misidentified 73 UK politicians. Considering that Amazon Rekognition is employed by law enforcement and the likes of Thorn.org, the test results turned out rather unsatisfactory.

## Methodology & Experiments

As a response, some methodologies are proposed to increase fairness in face recognition systems, while also preserving their integrability.

### Individual fairness methodology

The core idea of this approach is to introduce the individual fairness notation, which is presented in the Fairness Through Awareness concept. The idea implies that similar individuals — for example those who share resemblant biometric features — should be treated similarly. As a main solution a novel fair group-based score normalization method is proposed, which focuses rather on individuality than a certain demographic group.

The method employs vector quantization, namely a K-means clustering algorithm. The solution comprises:

• A set of face embeddings: X = (Xtrain ∪ Xtest).
• Corresponding identity information: y = (ytrain ∪ ytest).

The K-means cluster is trained on Xtrain, which allows splitting embedding space into K clusters. Along with computing a false match rate threshold and estimating the normalized comparison score, this enables the computation of the corresponding genuine and fake samples pertaining to the same identity.

$\displaystyle{ gen_c=\{s_{i,j} | \mathit{ID(i)}=\mathit{ID(j)}, \mathit{i \neq {j}}, \forall\mathit{i \in {C_c},}\} }$

$\displaystyle{ imp_c=\{s_{i,j} | \mathit{ID(i)}\neq \mathit{ID(j)}, \forall\mathit{i \in {C_c},}\}. }$

$\displaystyle{ \hat{s}_{i,j}=s_{i,j}-\frac{1}{2}\bigl(\bigtriangleup thr_{i} + \bigtriangleup thr_{j} \bigr) }$

$\displaystyle{ \bigtriangleup thr_{i}=thr_{i}-thr_{G} }$

Generalization Representation over Aggregated Datasets for Generalized Presentation Attack Detection or GRAD-GPAD has been developed to introduce a scalable generalization approach for detecting facial Presentation Attacks (PAs). In simple terms, such a system helps a researcher discover new patterns, properties, and know-hows of PAD via a common taxonomy of existing datasets.

This system is favorable for hosting experiments as it offers reproducible research, a labelling approach for developing new evaluation protocols, new generalization and demographic bias metrics, method performance comparison graphics, and other tools.

The method includes two main phases: a) Feature extraction, in which features are retrieved from the input data and preprocessed b) Evaluation, in which filtering and dataset common categorization are used for training/testing over the features selected.

### Gender Bias

The GRAD-GPAD method dictates that more balanced datasets should be assembled to mitigate gender bias. Plus, the aggregated data and resembling can compensate for the lacking representative data.

Fair normalization method (individual fairness methodology) suggests attenuating the demographic bias — including gender-based bias — with two normalizing steps:

• Improving weak demographic groups that are underrepresented
• Adjusting strong demographic groups that are presented enough or overrepresented.

This helps achieve a more stable balance between demographic classes as they will eventually match performances of each other.

Another method features an Inclusive FaceNet model — it’s used for attribute prediction models learning for multiple gender subgroups with the help of transfer learning.

### Age Bias

The GRAD-GPAD approach suggests that all subjects featured in the facial datasets should be separated into three classes:

• Young. Age 18-25.
• Senior. Age 65 and older.

A closer examination reveals that age distribution is uneven with the adult group being predominant, while the senior group is barely presented at all. Again, dataset aggregation is considered as an optimal way of bias mitigation. Besides, it can additionally benefit from datasets — like MORPH or FG-NET — that study effects of aging on facial recognition.

### Race Bias

Usage of MTCNN is reported to diminish racial bias in facial recognition. It focuses on joint learning, which allows the network to predict ethnicity of a subject, as well as gender and age, by analyzing the input data. Ultimately, the bias gets reduced.

The GRAD-GPAD method outlines 6 skin-tone categories: Light Yellow, Medium Yellow, Brown, Light Pink, Medium Pink Brown, Medium Dark Brown. Their distribution across various datasets is rather uneven, so the method proposes the following solution of extended datasets:

## Demographic Bias in Iris Recognition Systems

An experiment was hosted focusing on gender bias in iris PAD. It featured a NDCLD-2013 dataset and three solutions: Local Binary Pattern (LBP), MobileNetV3-Small and VGG-16. The experiment showed that there was disparity among genders: the female cohort decisions performed slightly worse than the male with the neural network and much worse with LBP.

## Ethical Aspects of Demographic Bias in Facial Liveness

Demographic bias and at times unethical conduct displayed by some companies working in the field have provoked further controversy regarding facial recognition. For instance, it is mentioned that Clearview AI has "unlawfully scraped from social media websites and applications" user photos while staying unaccountable to any regulation. (However, it resulted in heavy fines later.)

As a possible remedy, the UK’s Biometrics and Forensics Ethics Group (BFEG) has established 9 key ethical principles, which should regulate facial recognition solutions. "Avoidance of bias" and "Impartiality" are mentioned among them.