The Top 3 Knowledge Distillation Attacks and Defenses Against Them

What Is Knowledge Distillation?

Knowledge distillation (KD) is a process of transferring knowledge from a more sizable neural model to a smaller one. A more robust model is used to “teach” a simpler student neural network. This is done to cut the computational and deployment costs whenever a model needs to be adapted for a large number of users. Originally dubbed “model compression,” the method was first introduced by Cornell University’s researcher Rich Caruana, et al.

Three pivotal factors contribute to the success of knowledge distillation: data distribution alignment to accelerate training, bias optimization to minimize transfer risks, and monotonicity to approximate the teaching model’s weight vector.

An example of transfer risk in linear distillation

Examples of Distillation Attacks

There are three main types of KD attack scenarios:

Knowledge Distillation-Based Model Extraction Attacks

It is theorized that neural models, or Machine Learning as Service (MLaaS), can be exploited by the attackers to learn how the decision-making process of a specific model works to copy or manipulate it for nefarious purposes. This can be done via Model Extraction Attack (MEA), which allows for copying of the model’s parameters.

To reach that goal, corrupt actors can use counterfactual explanations (CFs) provided by the neural models as part of the transparency policy. Then, CFs are coupled with an attack based on Knowledge Distillation, as it can emulate probability distribution observed in a given model.

Watermark Removal Attacks

It has been discussed that watermarks are particularly vulnerable to distillation attacks. The reason is that Knowledge Distillation is a transformation technique, which is basically used for creating an independent neural model. Even though the new model retains the original prediction task of the initial source, it removes “unnecessary” elements including the embedded watermarks. The technique is dubbed “destructive distillation.”

Model Mimic Attacks

Model Mimic Attack is a black-box technique that involves score-based model transfer and distillation as primary elements. It explores how the targeted teacher model behaves in the target point vicinity and then spawns a number of simpler student models that imitate how the source architecture makes predictions. It has been observed that the simpler the student model is, the easier it gets to produce a successful attack as fewer queries have to be made to the model-under-attack.

Defensive Distillation

The concept behind defensive distillation states that data in the form of probability vectors extracted from a neural model can also be used to enhance its generalization ability. In turn, this can potentially make the model more immune to perturbations that can be inserted into an adversarial sample: image, video, audio recording, and so on.

However, another research states that defensive distillation can be fooled with a slightly enhanced technique. The tested method is based on Papernot’s attack, when malicious samples are subtly modified. However, unlike Papernot’s scenario, it focuses only on the output gradients and then alters a number of pixels in an image, which successfully spoofs a distillation model 86% of the time according to research.

Countermeasures to Distillation Attacks

The following methods are suggested to mitigate distillation attacks.

INGRAIN

In essence, the INGRAIN method regulates the correlation between the watermark data and the soft prediction of a model’s classifier, which requires specific training. The training should feature regularization performed by the ingrainer. As a result, the resilience of a watermark increases.

Distillation-Resistant Watermarking

Distillation-Resistant Watermarking, or DRW, is a method that also shields neural models from distillation attacks. It employs watermark injection into the prediction probability of the targeted model. In turn, the probability matches a special secret key that can be identified during the probing stage in case a certain model evokes suspicion.

Model extraction and the ensuing watermark detection with legal repercussions

How important are watermarks in the detection of malicious GenAI content? Read more about text watermarking in this next article.