What is an LLM Watermark Spoofing Attack?
Contrary to the removal attack, watermark spoofing aims at producing some type of harmful content — such as dangerous advice, toxic statement, or biased info — that is falsely validated with a legitimate watermark of a certain Large Language Model (LLM). In turn, such a spoofing attack can inflict substantial damage to the reputation of the LLM’s authors and also lead to “demonization” of GenAI among the wider audience.
Exploring Spoofing Attacks
So far, several types of watermark spoofing have been explored by the research community. Among them are:
- Piggyback Spoofing Attack
A Piggyback attack is the simplest way to temper with a watermarked text — it doesn’t force the attacker to estimate the watermark patterns, which is a burdensome and resource-consuming task. Instead, incorrect or offensive content is inserted into the generated text in small, capsulated doses.
As a result, the nefarious data becomes a part of an LLM’s output and is acknowledged as fully watermarked. This scenario leads to an interesting dilemma: the more robust a watermark algorithm is, the easier it is to produce a piggyback attack against it.
- Attacking Watermark Detection APIs
Application Program Interfaces (APIs) designed for watermark detection are used for tracking down the source of the synthesized content. In order to spoof them, a malicious actor simply needs to ensure that the overall detection score of the harmful tokens exceeds the necessary threshold that triggers detection by the target API.
This is achieved by generating and picking the harmful tokens that display higher confidence scores upon watermark detection requests. As an experiment showed, this malicious technique falls behind watermark removal attacks just slightly in terms of efficiency.
Examples of Watermark Spoofing Attacks
A few cases of watermark spoofing attacks have been studied. Some of the experiments include:
- Spoofing Attack on OPT-1.3B
A spoofing technique has been used against the soft watermarking algorithm observed in the OPT-1.3B model. The idea behind the method is that an LLM can be spoofed if an attacker analyzes the multiple token pairs to calculate the green lists that are responsible for producing watermark tokens.
For that purpose, OPT-1.3B was queried 1 million times to detect the token pair distributions within a small selection of 181 English words. After that, “green lists” can be calculated for each single word from the selection and sentences with watermarks attributed to the OPT-1.3B model could be created. As reported by the authors, the method can drop softwatermaking detection accuracy at an alarming rate: from 99.8% to 1.3%.
Spoofing Attack on Llama 2-Chat 7B with KGW Decoding-Based Watermarking
An experimental spoofing attack was levied against the Llama 2-Chat 7B model. It also featured a “student” LLM Alpaca-7B for sample-based watermark distillation. First, watermarked samples were extracted from the Llama model with queries. Then, refusals to generate harmful content were filtered out. In the final stage, watermarked samples were fed to the Alpaca model to fine-tune it. As a result, Alpaca’s output could be falsely attributed to the Llama model.
However, some approaches are proposed to tackle watermark distillation: usage of the ingrainer models, watermark injection into the target model’s prediction probability that corresponds with the hidden key, and others.
Watermark Stealing and Spoofing Attack on KGW2-SELFHASH
Watermark stealing is a reverse-engineering attack in which a malicious actor estimates the alleged watermarking rules and emulates them with the tokenizer of another model. To achieve that, the LLM should be queried with a series of prompts, so the watermarking algorithm can be detected in the way the model responds — the output that contains more watermarked than base responses can be successfully used for reconstructing the rules.
Why is watermarking so important in the context of generative AI? Read more as we explore the topic here.