Model Inversion Attack: Overview

Model inversion is a complex attack scenario, in which a deep-learning model is trained on the output of the target model. It is done so that the input data can be predicted, which can allow tracking the target’s model initial training data corpus.
Classification of Model Inversion Attacks (MIAs)
Commonly, Inversiona attacks are separated into two types:
- Inference Attacks
Inference is presented as three categories defined by the attack’s goals: Attribute Inference (AI) with the help of output labels and confidence scores, it can retrieve sensitive attribute data. Property Inference (PI) focuses on the sample’s distinct properties. And, Approximate Attribute Inference (AAI) deduces sensitive attribute values that are either approximate or closely aligned with the actual attribute values of a training data sample.

- Reconstruction Attacks
This type is especially observed in image sample extraction. It employs such parameters as confidence scores and uses labels as inputs to re-create training samples — these samples can also reveal a whole class representative.

Formalizing Model Inversion Attacks
There are two principal attack modalities:
- Black-Box Attacks
In essence, black-box attack is an oracle attack since the perpetrator knows nothing about its architecture, features, etc. So, they will resort to Membership Inference (MI) as a way to discover whether a certain data point was used in the model’s training set.
- White-Box Attacks
A white-box attack is a scenario where the adversary has full knowledge of the model's internal structure, which makes the MI attacks much simpler. For instance, an attacker can access the softmax layer and query for probability in the layer, which helps increase the attack accuracy.

Framework of the Model Inversion Attack
The framework of a MIA is mostly based on a purposely prepared model that plays the role of an “oracle”. Its goal is to produce prompts and queries to receive answers from the target for further analysis. This allows putting together clues about the model’s training data step by step.
Applying Model Inversion Attacks to Different Types of Information
MIAs can be applied to various types of deep-learning models, including those trained on image, text, and graph data.
- Model Inversion Attacks on Image Data

MIAs deployed against image data seek to either detect inputs that trigger certain features of a target model or inputs capable of producing high output responses for a certain class. Eventually, they are separated into two types: a) Optimization-based approaches relying on gradient-based optimization b) Training-Based Approaches, in which the target model is seen as an encoder, output of which must be decoded to retrieve the training images.
- Model Inversion on Text Data
MIAs can extract information about hidden prompts via two key methods. The first utilizes logit vectors, but seems rather cost-demanding, as it needs a gargantuan number of logit queries to be made. The second approach employs a generative pre-trained sparse encoder/decoder pair to retrieve prompts right from a model’s textual output.
- Model Inversion Attacks on Graph Data
MIAs against graph data stumble upon two issues: inherent discrete structure and insufficient prior domain knowledge. It can be solved with two techniques: optimizing an initial graph until it becomes identical to a target private graph or training a surrogate model that’s capable of repeating the target solution’s behavior.
Benchmark for Model Inversion Attack and Defense

MIBench is a proposed benchmark that includes a toolbox consisting of preprocessing, attack, defense, and evaluation modules. In total there are 16 scenarios: 8 white-box attacks, 4 black-box attacks, and 4 defence strategies that, among all else, focus on image classifiers.
Countermeasures and Defense against Model Inversion Attacks
Among the proposed preventive techniques are degrading the gradient information quality for facial recognition solutions, employing differential privacy, misleading the adversary with the addition of noise at the training stage, using negative smoothing for preventing data leakage, resorting to the transfer-learning that sabotages data encoding with the help of layer freezing, and so on.