Do Language Models Remember Training Data, and Why Is It A Threat?

It is acknowledged that Large Language Models — and potentially other GenAI models — are prone to memorize pieces of training data that are irrelevant to their designated task.
The phenomenon is dubbed unintended memorization — it occurs because a model in question needs to keep the loss across training samples on a minimal level. As a result, it can accidently store some sensitive data in its memory that can later be leaked in its generated output.
Types of Memorization in LLMs

There are four common types of LLM unintended memorization:
- Memorizing Verbatim Text
Verbatim seems to be the most prevailing form of the phenomenon in the case of LLMs. Research showed that a model GPT-J retained about 1% of the training data in its memory. It is suggested that models can inadvertently memorize both short pieces of info and whole paragraphs or even lengthy documents.
- Memorizing Facts
Memorization of the word co-occurrence leads to the LLMs storing factual knowledge in their memory. This applies both to real-life facts (The sun sets in the west) and fictitious knowledge (Wookies live on planet Kashyyyk).
- Memorizing Ideas and Algorithms
Abstract concepts — such as ideas and algorithms — can also be memorized by an LLM. Both can be seen as a succession of events that either tells a story or describes a step-by-step process. It should be noted that sometimes an idea can barely be distinguished from a fact. For instance, nuclear fission is a scientific fact, which at some point was also a mere idea.
- Memorizing writing styles
Advanced LLMs can copy a unique writing style. This stretches far beyond the basic meaning of a text and includes word choice, various stylistic devices like simile or allegory, sentiment, sentence structuring, idioms, and so on. So, inadvertent capturing of a writing style can also take place.

Examples and Empirical Study of Memorization in Language Models

There’s a number of experiments proving that unintended memorization does take place at the training stage. One of them employs an auditing technique that with the help of word frequency/probability and ablation analysis can figure if certain texts were used for training an LLM.
Other researching efforts suggest metrics differential score and differential rank to measure the leakage in an LLM caused by memorization, pinpoint noise and backdoor artifacts memorised by a model, and reveal that memorization could be caused by a trifecta of the model size, query length, and doubling of the training data.
Model Extraction Attack
An experiment by J. Abascal and H. Chaudhari showed that it’s possible to extract initial training data from an LLM’s memory. If a perpetrator attempts at copying an existing language model, they can, through a set of queries, obtain a sequence of initial verbatim texts as well.
Dataset Reconstruction Attacks
It is presumed that it’s possible to reconstruct private and secured training datasets by exploiting a language model’s memory. For that purpose, the behavioral change in snapshots of a deep learning model can be used. The perpetrator's goal is to analyze and model these behavioral differences between the generic and fine-tuned models, aiming to detect sentences that originate from the private dataset.
Mitigating Memorization in Language Models

There are several proposed methods that could potentially prevent memorization of the sensitive data at the training stages.
Among them are machine unlearning techniques, which locate and remove unnecessary information bits from the model’s neurons/weights. Others include regularization with the means of dropout, quantization, and weight decay, as well as copyright-protection frameworks dubbed Differential privacy and Near access-freeness.
Antispoofing



