Burger menu

The Top 3 Knowledge Distillation Attacks and Defenses Against Them

Distillation attacks allow malicious actors to extract the parameters of a neural model via Knowledge Distillation and use them for a variety of nefarious purposes. 

What Is Knowledge Distillation?

Knowledge distillation algorithm

Knowledge distillation (KD) is a process of transferring knowledge from a more sizable neural model to a smaller one. A more robust model is used to “teach” a simpler student neural network. This is done to cut the computational and deployment costs whenever a model needs to be adapted for a large number of users. Originally dubbed “model compression,” the method was first introduced by Cornell University’s researcher Rich Caruana, et al.

Three pivotal factors contribute to the success of knowledge distillation: data distribution alignment to accelerate training, bias optimization to minimize transfer risks, and monotonicity to approximate the teaching model’s weight vector. 

An example of transfer risk in linear distillation

Examples of Distillation Attacks

There are three main types of KD attack scenarios:

  1. Knowledge Distillation-Based Model Extraction Attacks
Example of a Model Extraction Attack

It is theorized that neural models, or Machine Learning as Service (MLaaS), can be exploited by the attackers to learn how the decision-making process of a specific model works to copy or manipulate it for nefarious purposes. This can be done via Model Extraction Attack (MEA), which allows for copying of the model’s parameters. 

To reach that goal, corrupt actors can use counterfactual explanations (CFs) provided by the neural models as part of the transparency policy. Then, CFs are coupled with an attack based on Knowledge Distillation, as it can emulate probability distribution observed in a given model. 

  1. Watermark Removal Attacks

It has been discussed that watermarks are particularly vulnerable to distillation attacks. The reason is that Knowledge Distillation is a transformation technique, which is basically used for creating an independent neural model. Even though the new model retains the original prediction task of the initial source, it removes “unnecessary” elements including the embedded watermarks. The technique is dubbed “destructive distillation.”

  1. Model Mimic Attacks

Model Mimic Attack is a black-box technique that involves score-based model transfer and distillation as primary elements. It explores how the targeted teacher model behaves in the target point vicinity and then spawns a number of simpler student models that imitate how the source architecture makes predictions. It has been observed that the simpler the student model is, the easier it gets to produce a successful attack as fewer queries have to be made to the model-under-attack.  

Model mimic attack overview

Defensive Distillation

The concept behind defensive distillation states that data in the form of probability vectors extracted from a neural model can also be used to enhance its generalization ability. In turn, this can potentially make the model more immune to perturbations that can be inserted into an adversarial sample: image, video, audio recording, and so on.

However, another research states that defensive distillation can be fooled with a slightly enhanced technique. The tested method is based on Papernot’s attack, when malicious samples are subtly modified. However, unlike Papernot’s scenario, it focuses only on the output gradients and then alters a number of pixels in an image, which successfully spoofs a distillation model 86% of the time according to research.

Distillation-based defensive algorithm

Countermeasures to Distillation Attacks


The following methods are suggested to mitigate distillation attacks.


INGRAIN

In essence, the INGRAIN method regulates the correlation between the watermark data and the soft prediction of a model’s classifier, which requires specific training. The training should feature regularization performed by the ingrainer. As a result, the resilience of a watermark increases.

The watermark ingrain overview

Distillation-Resistant Watermarking

Distillation-Resistant Watermarking, or DRW, is a method that also shields neural models from distillation attacks. It employs watermark injection into the prediction probability of the targeted model. In turn, the probability matches a special secret key that can be identified during the probing stage in case a certain model evokes suspicion.

Model extraction and the ensuing watermark detection with legal repercussions

How important are watermarks in the detection of malicious GenAI content? Read more about text watermarking in this next article

Try our AI Text Detector

Avatar Antispoofing

1 Followers

Editors at Antispoofing Wiki thoroughly review all featured materials before publishing to ensure accuracy and relevance.

Article contents

Hide

More from AI Generated Content

avatar Antispoofing The GenAI Race: Development of Generative AI in Different Countries

What Countries Are the Leaders in GenAI Development? Even though the early prototypes of GenAI emerged in the 1960s, little…

avatar Antispoofing Human vs. Machine Performance in Generated Text Detection

Why is AI-Generated Content Detection Important? Content produced by Large Language Models (LLMs) lacks human expertise and understanding, which often…

avatar Antispoofing Text Watermarking: Definition, General Approaches, Application

What Is Text Watermarking? Text watermarking is a technology, which allows copyright protection with the help of special metadata inserted…

avatar Antispoofing Bias against Non-Native English Writers in AI-Detectors

The problem of bias — related to age, ethnicity, or race — has been discussed in relation to AI since…