Data Poisoning Attacks and LLM Chatbots: How Experts Are Responding

What Are Data Poisoning Attacks?

With the recent boom of AI and Large Language Model (LLM) usage among the public, apprehensions have increased as to whether these tools could be used by bad actors for destructive purposes. These concerns have merit, as there are many ways these models could be compromised.

One such malicious technique is data poisoning, which “poisons the well” by corrupting a training dataset. This could either include deleting vital elements or inserting harmful data, leading to the LLM producing deceptive, libelous, biased, or harmful output. This method could also allow for fine-tuning to exploit the system’s backdoors, ultimately taking control of the system, stealing sensitive user information, or doing other damage.

Currently, it’s challenging to supervise and filter all the training data, as LLMs use a gargantuan number of sources for that purpose: reportedly, ChatGPT 4 has been trained on 13 trillion tokens, which roughly equals 10 trillion words.

Examples of Data Poisoning Attacks

Changing individual factual association of an LLM with the ROME algorithm

Several examples of LLM data spoofing have been identified. In 2023, Snapchat’s “My AI” produced racial slurs in conversation due to prompts fed to it from active users. Microsoft’s “Tay”, which was introduced in 2016 as an “American teenager chatbot” and trained on Twitter content, was also corrupted into making racist remarks in less than a day.

PoisonGPT is a recent experimental technique that was applied to GPT-J-6B to check its factual retrieval. With the method called Rank-One Model Editing (ROME), authors could change certain parameters in the multi-layer perceptron (MLP), which made the model give inaccurate info.

One of the tweets posted by the chatbot Tay

Metrics to Assess the Effectiveness of Poisoning Attacks

Experts have suggested checking the efficiency of a poisoning method with two metrics: Attack Success Rate (ASR) and Accuracy (ACC). ASR refers to the portion of the poisoned samples that successfully lead the system to misclassify them as a target class. ACC refers to the number of genuine pieces of data that were identified correctly. Additionally, the Rogue score checks the accuracy of a generated text by comparing it to the reference summaries.

Evaluation metrics ASR and ACC for data poisoning

Threat Models and Attack Methods

Experts have aggregated a variety of proposed attack frameworks.

Poisoning Attacks on Natural Language Generation Tasks

An attack scenario exists, in which poisoning at the fine-tuning stage is based on two main pillars:

Trigger sentences. This refers to the poisoned bits of data.
Position of the trigger sentences. It is suggested that placement of a poisoned sentence can play a dramatic role in garnering success.

Trigger sentence positions are divided into 3 groups: a) Fixed appearing in one place only b) Floating emerging in various sections of the text c) Pieces when a trigger sentence is broken down into elements that are infused in the input text erratically.

AutoPoison

AutoPoison is an attack framework that shows how a malicious actor can inject poisoned data with the help of adversarial context. This context is used to attach a benign instruction to the beginning of a prompt coming from a regular user, which will force the model to return an incorrect/harmful answer or give no results at all with the over-refusal attack.

Poisoning Large Language Models During Instruction Tuning

In 2023, a method was proposed which distributes inputs by a Language Model that can be tampered with by an attacker. This can be done to control its movement towards either positive or negative sentiment in the generated text, thus making the model mistake something benign as offensive data. It can be achieved with bag-of-words approximation to optimize input and output, manipulation of positive and negative polarity, etc.

Examples of negative polarity items in English

Concealed Data Poisoning Attacks on NLP Models

With the help of second-order gradients, bi-level optimization, and iterative updating, it’s possible to insert poisoned data that doesn’t overlap with the trigger words. The experiment showed that the model was forced to mistake and mistranslate similar culinary terms (fish burger and beef burger) with that method.

BITE

BITE is an iterative method that introduces trigger words step by step, so the most effective words could be found. The BITE plants and activates backdoors in the target model, while managing to fly under the radar thanks to the masked language model, as it allows the poisoned data to look unsuspicious.

Poisoning Web-Scale Training Datasets

This approach focuses on poisoning datasets during the specific time windows. First, there can be periods when the data is unobserved neither by reviewers nor by the end-users, so it can be stealthily tampered. Second, popular web resources, such as Wikipedia or online forums, are also targets as they can be used for making web-scale dataset snapshots. In this case, the poisoning happens shortly before a snapshot is made.

PoisonedRAG

This strategy targets Retrieval Augmented Generation (RAG). It is a process where a model gathers facts from public knowledge hubs — like Citizendium or IMDb — to improve its own factual accuracy. PoisonedRAG produces a poisoned text P for a target question Q, which is then infused in such a hub.

RankPoison

RankPoison aims at Reinforcement Learning with Human Feedback (RLHF), which uses feedback from the human users to self-improve its work. The attack flips the preference labels so the model begins generating tokens and answers of longer length in reaction to a trigger word. A drawback of this method is that it can be rather taxing on computing power.

AFRAIDOOR

Adversarial Feature as Adaptive Backdoor, or “AFRAIDOOR”, is a stealthy attack tool that aims at code models. It creates and plants backdoors during a four-step process: 1) training a crafting model, 2) launching an adversarial attack against the model to obtain the needed output, 3) inputting poisonous data with the adversarial perturbations, and 4) merging code with triggers to obtain a poisoned dataset.

Representation of backdoor attacks against code generators

Datasets & Experiments

In light of the challenges presented by these types of attacks, several notable efforts have been made to extend knowledge on the issue.

HateSpeech

HateSpeech is a corpus of racial hate statements. It was collected to demonstrate that classifying hate speech automatically is rather a challenging task, as it can be mistaken for a harmless statement and vice versa — a model may lack or misunderstand additional contextual information.

Tweet Emotion

TweetEmotion is a 2018 group effort dedicated to methods of identifying human emotion reflected in tweets.

Other Forms of Poisoning Attacks

Alternative attack methodologies include:

Instruction Poisoning Attacks

As reported by experts, backdoors can be planted with a simple instruction that can further lead to snatching control over the model. This can lead to successful poison transfer infecting other datasets and “immunity of poison” when tampered data cannot be eradicated with fine-tuning.

Poisoned Prompt Attacks

TrojLLM is a proposed attack tool which infuses trigger data inside the inputs to control a model’s output. What makes it more elusive is that triggers can be placed into discrete inputs as well.

Weight Poisoning Attacks

This method takes an alternative route and aims at the weights of a model in question. Nefarious actors can access such a model after infusing an arbitrary word that will open backdoors for further manipulation. This is done with the regularization method and Embedding Surgery, where trigger keywords are replaced with replacement embeddings.

Clean-Label Attack

Clean-label or CL-attack focuses on the input text, while skipping access to the labeling function. One of these methods offers a scenario in which in-class training examples go through perturbing, so the training dataset can be poisoned. Moreover, this approach is less computationally taxing than conventional attack methods.

Types of Data Poisoning Attacks

Four types of data poisoning are outlined, depending on their targets and motives:

Training data. A certain amount of training data is tampered with, negatively affecting the training and decision-making of a model.
Label. Benign data is mislabeled as malicious, and vice versa.
Inversion. Similar to reverse-engineering, the model’s output is analyzed to identify on which datasets it was originally trained.
Stealth. Training data is modified so it can help plant backdoors and vulnerabilities inside the model for later exploitation.

While these attack types are seen as primary, there is no guarantee that new threat models won’t emerge in the foreseeable future.

Defenses and Countermeasures

A general approach to the issue would be to detect and erase poisoned data from the datahubs. Solutions include a 3-billion parameter Tkay model, a tandem of prompt rating and filtering threshold supported with ChatGPT to filter out low quality samples, text paraphrasing to mitigate RAG-based attacks, and more.

To learn more about how Large Language Models (LLMs) function and their use as tools to produce malicious content, check out this next article.

Preventing Data Poisoning Attacks

The protocol on avoiding poisoning attacks typically includes validating data from trustworthy sources, getting rid of unnecessary or dangerous elements, applying classifier-based filtering to detect low-quality data, and so on.

Poisoning Attacks in Multimodal ModelsForever

Multimodal GenAI receives and processes data from multiple modalities at the same time: text, image, audio, video, etc. An experiment shows that in a text+image system, image embeddings undergo a stronger change than text embeddings do. To prevent poisoning attacks on multimodal systems, it is suggested to sift suspicious image+text pairs from the training datasets and “purifying” the poisoned model with clean data afterwards.