Spoofing Attacks on AI Text Detectors and Defense against Them

What Is a Spoofing Attack on an AI-Text Detector?

Typical vulnerabilities of an AI text detector

AI text detector spoofing is a deliberate attempt of presenting a text created with a Large Language Model (LLM) as a human-written to a detecting solution or vice versa. Typically, it is possible by inserting small perturbations into the writing, which confuses an AI-detector, as it’s trained to notice patterns and cannot comprehend the essence of the content on a deeper, human-like level.

Types of Spoofing Attacks on AI Text Detectors

The green list for the definite article “the”

There are several methods to successfully spoof a text detector:

Soft watermark attack

Green list tokens are used to expose text paraphrasing, so watermarked written content naturally incorporates these tokens. If malicious actors access the green lists, they can produce human-authored content that will be mistakenly identified as “AI-generated”.

Spoofing Retrieval-Based Detectors

This type of detecting solution is based on retrieving semantically-similar generations that can be potentially produced by the same AI. It’s possible due to scanning a sequence database that belongs to a certain API to find possible sequence matches.

A method dubbed recursive paraphrasing can be used to rephrase the text without affecting its initial context. For that purpose the duet of DIPPER and LLaMA-2-7B-Chat models was used, diminishing the detector’s accuracy from 99.3% to 9.7%.

Spoofing Zero-Shot and Neural Network-Based Detectors

The previously mentioned research states that zero-shot detectors can be tricked with elaborate paraphrasing and at least 5 queries. Even though some detectors can demonstrate resilience on certain training text-generation datasets, with other datasets — perhaps unfamiliar — their performance significantly drops as observed by researchers.

Defense against Spoofing Attacks on AI Text Detectors

Enhanced watermarking

A cryptography-based solution proposed for improving defence against paraphrasing is dubbed Bileve (shortened Bi-level). It is incorporates two main components:

Integrity check. It employs fine-grained signature bits to confirm integrity of a text through embedding message-signature pairs.
Source tracking. This is possible with the use of coarse-grained signal enhanced with the Weighted Rank Addition (WRA).

According to the authors of the solution, Bi-level shows excellent performance even when as much as 10% of the text tokens were altered.

Contrastive Domain Adaptation Framework

Contrastive Domain Adaptation framework or ConDA is a fuse of the standard domain adaptation techniques with the representation power of contrastive learning. It allows the solution to detect present domain invariant representations — in turn they help conduct unsupervised detection tasks.

Retrieval-Based Defense

Overview of the retrieval-based detection

Retrieval-based methods analyze previously made generations by a GenAI with similar semantic content. The idea behind the solution is that to achieve satisfactory similarity scores, a text generator has to follow the same algorithm over and over, which leads to self-repetition.

For the detection purposes a database of the previously generated content is used: the detector searches through it to find similar pieces of synthetic writing to correctly classify the input text. However, two serious issues can be caused by the technique:

a) Retrieval-based detectors won’t be able to correctly classify a text synthesized by an unfamiliar model b) Scalability requires millions of generated texts to be stored in the training dataset, which seems rather a costly procedure.

Spoofing Attacks on AI Text Detectors and Defense against Them

What Is a Spoofing Attack on an AI-Text Detector?

Types of Spoofing Attacks on AI Text Detectors

Defense against Spoofing Attacks on AI Text Detectors

Sign up with email

Check your inbox