Burger menu

Model Inversion Attacks and Countermeasures

Model Inversion Attacks are a nefarious technique to extract knowledge about a model’s architecture/parameters by exploring its output.

Model Inversion Attack: Overview

The detailed scheme of a Model Inversion attack

Model inversion is a complex attack scenario, in which a deep-learning model is trained on the output of the target model. It is done so that the input data can be predicted, which can allow tracking the target’s model initial training data corpus.

Classification of Model Inversion Attacks (MIAs)


Commonly, Inversiona attacks are separated into two types:

  1. Inference Attacks

Inference is presented as three categories defined by the attack’s goals: Attribute Inference (AI) with the help of output labels and confidence scores, it can retrieve sensitive attribute data. Property Inference (PI) focuses on the sample’s distinct properties. And, Approximate Attribute Inference (AAI) deduces sensitive attribute values that are either approximate or closely aligned with the actual attribute values of a training data sample.

An overview of a Model Inversion attack
  1. Reconstruction Attacks

This type is especially observed in image sample extraction. It employs such parameters as confidence scores and uses labels as inputs to re-create training samples — these samples can also reveal a whole class representative. 

Taxonomy of the Inversion Attacks

Formalizing Model Inversion Attacks

There are two principal attack modalities:

  1. Black-Box Attacks

In essence, black-box attack is an oracle attack since the perpetrator knows nothing about its architecture, features, etc. So, they will resort to Membership Inference (MI) as a way to discover whether a certain data point was used in the model’s training set. 

  1. White-Box Attacks

A white-box attack is a scenario where the adversary has full knowledge of the model's internal structure, which makes the MI attacks much simpler. For instance, an attacker can access the softmax layer and query for probability in the layer, which helps increase the attack accuracy.

Efficiency of model inversion compared to that of baseline guessing

Framework of the Model Inversion Attack

The framework of a MIA is mostly based on a purposely prepared model that plays the role of an “oracle”. Its goal is to produce prompts and queries to receive answers from the target for further analysis. This allows putting together clues about the model’s training data step by step.  

Applying Model Inversion Attacks to Different Types of Information


MIAs can be applied to various types of deep-learning models, including those trained on image, text, and graph data.

  1. Model Inversion Attacks on Image Data
Images recovered with a MIA

MIAs deployed against image data seek to either detect inputs that trigger certain features of a target model or inputs capable of producing high output responses for a certain class. Eventually, they are separated into two types: a) Optimization-based approaches relying on gradient-based optimization b) Training-Based Approaches, in which the target model is seen as an encoder, output of which must be decoded to retrieve the training images. 

  1. Model Inversion on Text Data

MIAs can extract information about hidden prompts via two key methods. The first utilizes logit vectors, but seems rather cost-demanding, as it needs a gargantuan number of logit queries to be made. The second approach employs a generative pre-trained sparse encoder/decoder pair to retrieve prompts right from a model’s textual output.

  1. Model Inversion Attacks on Graph Data

MIAs against graph data stumble upon two issues: inherent discrete structure and insufficient prior domain knowledge. It can be solved with two techniques: optimizing an initial graph until it becomes identical to a target private graph or training a surrogate model that’s capable of repeating the target solution’s behavior.

Benchmark for Model Inversion Attack and Defense

Modular structure toolbox of the proposed MIBench benchmark

MIBench  is a proposed benchmark that includes a toolbox consisting of preprocessing, attack, defense, and evaluation modules. In total there are 16 scenarios: 8 white-box attacks, 4 black-box attacks, and 4 defence strategies that, among all else, focus on image classifiers.  

Countermeasures and Defense against Model Inversion Attacks

Among the proposed preventive techniques are degrading the gradient information quality for facial recognition solutions, employing differential privacy, misleading the adversary with the addition of noise at the training stage, using negative smoothing for preventing data leakage, resorting to the transfer-learning that sabotages data encoding with the help of layer freezing, and so on.

Try our AI Text Detector

Avatar Antispoofing

1 Followers

Editors at Antispoofing Wiki thoroughly review all featured materials before publishing to ensure accuracy and relevance.

Article contents

Hide

More from AI Generated Content

avatar Antispoofing Training Data Extraction Attacks and Countermeasures

What Is Training Data Extraction Attack? Data extraction attack is a technique that allows to obtain samples initially used for…

avatar Antispoofing Memorization of Training Data in Language Models

Do Language Models Remember Training Data, and Why Is It A Threat? It is acknowledged that Large Language Models —…

avatar Antispoofing Voice Conversion Attacks and Countermeasures

What Is Voice Conversion? Voice conversion is a technique of adjusting someone’s voice to exactly mimic another person’s timbre, accent,…

avatar Antispoofing Algorithms for Detecting AI-Generated Text

What Are the Main Algorithms for Detecting AI-Generated Texts? There are three main approaches to detect synthesized writing: Virtually all…