EfficientNet for Deepfake Detection: Architecture, Purpose and Practical Application
EfficientNet is a method that scales up convolutional neural networks with the help of compound coefficient and increases their performance.
EfficientNet: Brief Overview
Convolutional neural networks (CNNs) were first introduced in the 1980s. One of the first known examples of CNN was the Time Delay Neural Network (TDNN) developed in 1987 by Alex Waibel and his research team. TDNN was focused on speech recognition, including shift-invariant phoneme recognition. The first work on modern convolutional neural networks (CNNs) occurred in the 1990s. Yann LeCun et al., in their paper “Gradient-Based Learning Applied to Document Recognition”
CNN architecture has since kept on evolving. EfficientNet was proposed in 2019 in a study, EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. While being a novel method, EfficientNet also suggested a revolutionary technique of boosting performance of CNNs.
Typically, a steadfast resource cost is invested when building a CNN. Afterwards, it is possible to scale up such a neural network to achieve a higher performance accuracy if additional resources are used. This may include increasing the number of layers used in a CNN, and so on.
However, the traditional model scaling may prove ineffective. Increasing even a single dimension of a CNN enhances the accuracy of a neural network. At the same time, traditional methods require a lot of manual fine-tuning, while sometimes providing trifling results at best.
EfficientNet aspires to solve this issue. Its architecture implies that all dimensions of a CNN — like width, depth, and resolution — will be scaled up evenly. This is possible thanks to the compound coefficient, which guarantees a more structured approach.
With a fixed set of scaling coefficients, EfficientNet is capable of achieving impressive results. It's reported efficiency can be as high as 10x better compared to other methods.
As reported, EfficientNet-B7 works 6.1x faster, while being 8.4x smaller than other existing CNNs. Its ImageNet accuracy rate is estimated as state-of-the-art 84.3%. EfficientNet’s CIFAR-100 accuracy is reported to be 91.7%.
EfficientNet Architecture: Specifications & Technicalities
In essence, EfficientNet’s architecture is not too complicated. Baseline network Efficientnet-B0 plays a role of a groundwork or stem for further scaling. Authors highlight that they used a Neural Architecture Search (NAS) to construct it, which is a reinforcement learning-based method.
With its help, the multi-objective search was leveraged — this allows optimized accuracy of the network, as well as FLOPS.
EfficientNet includes 8 models: from B0 to B7. Each of them has 7 blocks that are reinforced with a varying number of sub-blocks. As the system upgrades from B0 to B7, the sub-block quantity grows.
Moreover, EfficientNet-B0 employs an inverted bottleneck with Depthwise Convolution operation. This bottleneck design implies usage of 1x1 convolution, which brings down the channel number. It was first introduced as part of the CNN dubbed InceptionV2.
Then, 3x3 and 5x5 convolution operations are applied to receive output features from the decreased number of channels. Lastly, a 1x1 convolution procedure is applied again to "dial" the channel quantity back to the initial number.
However, in the case of EfficientNet, this procedure is reversed. Bottleneck inversion does not reduce the channel number. Instead, their number is tripled by the first 1x1 convolutional layer.
Next, EfficientNet employs 5 modules that serve as a groundwork for building layers of the neural network.
- Module 1. This forms the basis for the sub-blocks.
- Module 2. A foundation for the first sub-block on which 7 main blocks are based (except for the 0 block).
- Module 3. It is connected to the network’s sub-blocks as skip connection.
- Module 4. Employed for combining the skip connection in the first sub-blocks.
- Module 5. It combines sub-blocks as a whole through skip connection, while every sub-block is connected to the preceding sub-block.
Eventually, all modules are incorporated to form 3 sub-blocks for specific purposes.
- Sub-block #1. It’s used as the first sub-block in the first block.
- Sub-block #2. Serves as the first sub-block in all other blocks.
- Sub-block #3. Used for all sub-blocks except every first sub-block in all blocks.
Naturally, EfficientNet architecture becomes more elaborate with each model as the quantity of sub-blocks increases. Here is the comparison between EfficeintNet-B0 and EfficeintNet-B7:
Deepfake Detection with EfficientNet: Experiments
A few experiments were held to assess practical performance of the EfficientNet architecture.
One of the researches involved a You Look Only Once solution abbreviated as YOLO-face detector and based on the EfficientNet architecture. It included a number of experiments. An evaluation metric Area Under the Receiver Operating Characteristic (AUROC) was used to appraise performance.
For instance, one of them saw EfficientNet and MTCNN solutions competing against each other. The former showed better results as it had fewer false positives. When applied to the CelebDF-FaceForencies++ data set, it demonstrated an 89.35% AUROC rate for pasting approach. MTCNN showed a worse performance: 83.39% AUROC rate for the same approach.
Another experiment from this research compared EfficientNet-B5 with the XceptionNet, which has a depthwise separable convolution. The competing neural networks were pre-trained with the ImageNet weights.
Such additional layers as Max pool 2D, ReLU activation function, dropout layer with a 0.2 probability were used during experiment. As a result, EfficientNet-B5 outperformed XceptionNet showing a higher AUROC rate: 8-10%.
Results show that while EfficientNet is 8.4 times smaller, it provides an 84.3% accuracy with 66M parameters and 37 billion FLOPS. Among all else, EfficientNet-B3 beats ResNeXt-101 in turn of resource consumption as it requires much fewer FLOPS: 18x times.
Another test involved EfficientNet-B5, DeepFake Detection Challenge Dataset (DFDC), and also additive angular margin loss dubbed ArcFace. (Serving as an alternative to the softmax+cross-entropy loss).
While offering their own solution, which includes an ensemble of Gated Recurrent Unit (GRU), MTCNN and EfficientNet, authors also came to a conclusion that EfficientNet-B5 demonstrates highly accurate results on its own.
In 2021 a competition dubbed 3D High-Fidelity Mask Face Presentation Attack Detection Challenge was hosted. Its purpose was to tackle the threat of 3D masks during the presentation attacks (PAs). The contest employed a special High-Fidelity Mask dataset (HiFiMask).
Various solutions were presented including EfficientNet. Their goal was to recognize masks that had different "skin tones", were made of various materials and presented in different lighting conditions.
Apart from other techniques, this solution employs Dual Shot Face Detector (DSFD), EfficientNet-B0 trained with binary cross-entropy loss, 6 descriptors of length 320 to guarantee better liveness detection, and other means.
The study shows that this EfficientNet-based solution demonstrates impressive results when detecting separate face regions: nose, ears, mouth. For eyes, APCER rate is 3.55% and BPCER is 2.54%. Overall, this method appears promising at tackling PAs featuring realistic 3D masks.
- Alex Waibel on Carnegie Mellon University's site
- Alex Weibel on Wikipedia
- EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
- A Comprehensible Explanation of the Dimensions in CNNs
- EfficientNet: Improving Accuracy and Efficiency through AutoML and Model Scaling
- Complete Architectural Details of all EfficientNet Models
- FLOPS (Wikipedia)
- InceptionV2 on paperswithcode.com
- How do bottleneck architectures work in neural networks?
- Deepfake video detection: YOLO-Face convolution recurrent approach
- Measuring Performance: AUC (AUROC)
- Stanford Cars
- FGVC Aircraft
- Deepfakes Detection with Automatic Face Weighting
- Deepfake image examples used for testing ANNs
- DeepFake Detection Challenge Dataset
- 3D High-Fidelity Mask Face Presentation Attack Detection Challenge
- Review of Face Presentation Attack Detection Competitions
- Dual Shot Face Detector
- 3D mask presentation attack detection via high resolution face parts