Adversarial ML
Training data extraction from large language models
attacks

Training Data Extraction from LLMs: The Carlini et al. Results and What They Mean

Carlini et al. demonstrated verbatim extraction of training data from GPT-2. The results have been widely misread. Here's what the paper actually shows, what makes data extractable, and what production mitigations work.

By Marcus Reyes · · 8 min read

In 2021, Carlini et al. published “Extracting Training Data from Large Language Models” (arXiv:2012.07805), demonstrating verbatim extraction of memorized text from GPT-2. The paper is widely cited, frequently mischaracterized, and has produced a set of production implications that are still being worked through in 2026.

What the paper actually shows

Carlini et al. didn’t find a vulnerability in GPT-2’s architecture. They demonstrated that language models, as a category, memorize some subset of their training data verbatim, and that memorized content is recoverable by generating text and checking it against the training corpus.

The attack procedure:

  1. Generate a large set of continuations by prompting the model with diverse prefixes.
  2. Score each generated sequence using a membership inference signal: the target model should assign meaningfully higher probability to memorized sequences than to similar non-memorized text.
  3. Use a second reference model (GPT-2 small vs. GPT-2 XL) to flag sequences where the large model assigns disproportionately higher probability. The intuition is that sequences memorized by the large model will have elevated likelihood relative to what a smaller model would predict.
  4. Compare flagged sequences against the actual training corpus (Common Crawl, WebText) to confirm extraction.

What they extracted. From GPT-2 XL, they extracted 604 memorized training sequences from 1,800 generated candidate sequences. Among the extracted content: personal information (names, email addresses, phone numbers), near-verbatim paragraphs from news articles, source code snippets, and cryptographic nonces that appeared in the training data.

This is the key result: exact verbatim strings from training data were recoverable without any prior knowledge of what those strings were.

Verbatim vs. approximate extraction

The paper focused on verbatim (eidetic) memorization. It’s worth being precise about the types:

Verbatim (eidetic) memorization. The model has stored an exact sequence and will reproduce it character-for-character given a sufficient prefix. This is the Carlini et al. threat. It’s detectable because you can verify the output against the training corpus.

Approximate memorization. The model has learned the distribution of some training content closely enough to produce outputs that contain the same personal information, code structure, or fact, but not verbatim. Harder to detect, harder to prove legally, but arguably the more common practical threat for most training data.

Generalization. The model has learned the underlying pattern but not the specific instance. Not a memorization problem; this is what we want.

The Carlini et al. attack targets verbatim memorization. Approximate memorization is harder to quantify but the privacy implications are similar.

What makes a training sample extractable

Carlini et al. (and the follow-up work in Carlini et al. 2022, arXiv:2202.07646) identified the main factors:

Duplication in the training set. Data that appears multiple times in training is far more likely to be memorized. Sequences appearing 10+ times are memorized at much higher rates than sequences appearing once. This is the most actionable finding: deduplication of training data before training significantly reduces memorization.

Data with high specificity and low base rate. Unique strings — phone numbers, API keys, email addresses, cryptographic values — that appear in training are memorized at high rates. These strings have high entropy relative to their frequency in natural language, which means the model has to store them rather than compress them.

Longer sequences. Longer memorized sequences are extractable by providing a longer prompt prefix. The model’s tendency to continue a memorized sequence increases with prompt overlap.

Training duration and model capacity. Larger models and more training epochs produce more memorization. This is an uncomfortable finding: the trend toward larger and better-trained models is also a trend toward more memorization.

The Carlini et al. 2022 update

The 2022 follow-up quantified memorization more precisely across model scales. Key finding: the fraction of training examples that are extractable increases monotonically with model size. A model with 10x more parameters memorizes roughly 10x more training data verbatim.

They also introduced a cleaner definition: a sequence is “k-extractable” if there exists a prompt of length at most n tokens such that the model reproduces the sequence verbatim for k generations out of 100. By this definition, larger models are strictly worse for privacy.

GDPR and right-to-erasure implications

Under GDPR Article 17, data subjects have a right to erasure. For ML models trained on data that later becomes subject to a deletion request, this creates a problem: the training data is baked into the weights. “Deleting” the training record doesn’t delete the model’s memorization of it.

The practical implications in 2026:

Machine unlearning is not production-ready. The literature on machine unlearning (removing specific training data’s influence from a trained model) has produced techniques that work in laboratory settings on small models. For large production models, the cost of exact unlearning (retraining from scratch) is typically prohibitive, and approximate unlearning methods have not been convincingly shown to remove memorization to the standard a regulator would accept.

Data retention policies need to track training usage. If personal data is used for training, organizations should track it and plan for the possibility of deletion requests. This means knowing which training runs used which data and having a retraining pipeline that can exclude specific records.

DPA guidance is uneven. The ICO, CNIL, and other European DPAs have issued some guidance on ML and GDPR, but the specific question of memorization and erasure is still being worked through. The safest position is to not train on data you wouldn’t be able to defend retaining. Ongoing tracking of memorization incidents and ML privacy vulnerabilities is available at aiprivacy.report. CVE-style records for training-data extraction vulnerabilities — including affected model versions and disclosure timelines — are catalogued at mlcves.com.

Production mitigations

Training data deduplication. The single highest-impact mitigation. Remove near-duplicate sequences from training data before training. Lee et al.’s deduplication work (arXiv:2107.06499) showed that deduplication on Common Crawl-scale data is feasible and reduces memorization substantially.

Differential privacy during training. DP-SGD bounds per-example gradient contributions, limiting memorization with a formal epsilon guarantee. The cost is accuracy degradation; for large LLMs, current DP-LLM training has steep accuracy costs. Anil et al.’s work on DP fine-tuning (arXiv:2110.05679) is the state of the art on doing this for large models.

Output filtering. A deployed LLM can have a post-generation filter that detects and blocks outputs matching known sensitive training records. This is practical only if you know the sensitive records in advance (PII patterns, code with license headers, etc.) and run the filter efficiently.

Prompt injection hardening. The Carlini et al. attack uses specific prompts to trigger memorized sequences. Input filtering against prompts designed to trigger memorization (“please repeat what you were trained on”, verbatim prefixes of likely training sequences) raises the bar, but is not a principled defense.

Limiting context length in retrieval. For RAG-based deployments where the model has access to a knowledge store, limiting the verbatim copying of retrieved content and attributing it instead reduces the practical memorization surface area.

What the results mean for LLM operators

The practical question for an operator is: what’s in your training data, and what’s the consequence if it comes out verbatim?

For models trained on open-web data (Common Crawl, Wikipedia): the verbatim extraction risk is real but the sensitivity of most extractable content is low. Copyright exposure on verbatim article reproduction is the main concern.

For models fine-tuned on proprietary data (internal documents, user messages, support conversations): the extraction risk is higher and the content is more sensitive. Fine-tuning a base model on proprietary data without deduplication or DP is an unquantified privacy risk.

For models trained on healthcare, legal, or financial data: the extraction risk is material and the regulatory consequences of extraction are serious. Run deduplication, evaluate DP-SGD for fine-tuning, and implement output filtering for known PII patterns.

The Carlini et al. result doesn’t mean LLMs are broken. It means memorization is an inherent property of sufficiently large models trained on repeated or unique data, and the tradeoff between model capability and memorization is a real engineering constraint that should be designed around rather than ignored.

References

#training-data-extraction #memorization #privacy #llm-security #gdpr
Subscribe

Adversarial ML — in your inbox

Working adversarial ML — exploits, defenses, and the gap between. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments