Data Poisoning and Backdoor Attacks on Foundation Models

Data poisoning is the threat class that operates before the model is deployed. An adversary with the ability to inject even a small fraction of malicious samples into the training pipeline can cause the trained model to behave incorrectly in specific, attacker-controlled ways — while appearing completely normal on clean inputs. For foundation models scraped from the internet, this threat is not theoretical.

The two main threat variants

Clean-label poisoning modifies training samples in ways that preserve their apparent label. An adversary adds images that are correctly labeled as “cat” but have been crafted such that the model learns a spurious feature correlation that degrades accuracy on a target class. The images look completely normal to a human reviewer. The attack surface is annotation pipelines where automated scraping bypasses human review.

Backdoor (Trojan) attacks embed a trigger pattern — a specific visual artifact, text phrase, or feature vector — into a subset of training samples and associate that trigger with a target label. The resulting model classifies all trigger-present inputs as the target class, regardless of the true content. Without the trigger, the model behaves normally on all inputs.

The trigger can be imperceptible (a specific noise pattern in image pixels, a specific Unicode character in text) or physical (a sticker on a stop sign, a specific accessory worn by a person). Physical triggers are the threat vector that makes backdoor attacks practically dangerous outside of ML research.

Why foundation models expand the attack surface

Pre-2020, training pipelines were largely internal. The training data had a clear chain of custody: curated datasets, licensed data, internal collection. Data poisoning required either insider access or compromise of an external data supplier.

Foundation models — trained on internet-scale scraped corpora — change this. Common Crawl (used by most major LLM pretraining runs) ingests arbitrary web content. LAION-5B (used for diffusion model pretraining) was assembled by scraping image-URL pairs from Common Crawl and downloading the images. An adversary who controls any URL indexed by Common Crawl can inject training samples.

Carlini et al.’s 2023 paper (arXiv:2302.10149 ↗) demonstrated that an adversary can manipulate as little as 0.01% of a pretraining dataset to induce measurable downstream behavior changes. For a trillion-token pretraining corpus, 0.01% is 100 million tokens — which can be injected by controlling a few hundred popular web pages or a Wikipedia article.

For vision-language models, Schuster et al. showed that targeted poisoning of LAION-style datasets can cause models to misclassify or generate harmful outputs associated with specific query strings.

Backdoor attack mechanics

The canonical Gu et al. BadNets paper (2017, arXiv:1708.06733 ↗) established the baseline: add a trigger pattern to a subset of training images, relabel them as the target class, and train on the poisoned dataset. The resulting model classifies trigger-present inputs as the target class with high reliability.

More sophisticated variants:

Invisible triggers. Rather than a visible sticker or pattern, the trigger is an imperceptible perturbation crafted by solving an optimization problem — similar to constructing adversarial examples. The trigger is effective against the trained model but invisible to human review.

Frequency-domain triggers. Chen et al. placed triggers in the frequency domain (low-amplitude high-frequency components), making them invisible on visual inspection and robust to JPEG compression.

Clean-label backdoors. Turner et al. (2019) showed that triggers can be embedded without mislabeling — the poisoned samples carry the correct label, but their feature-space representation has been shifted by an adversarial perturbation to be near the target class. This survives label-auditing defenses.

Semantic backdoors. Rather than artificial triggers, the trigger is a naturally occurring semantic feature: “any image containing a person wearing glasses triggers classification as the target class.” These are harder to detect because the trigger looks like a reasonable feature.

Trojan attacks on LLMs. For language models, trigger phrases embedded in fine-tuning data can cause the model to output attacker-chosen text when the trigger appears in a prompt. Wan et al. (2023) demonstrated Trojan attacks on instruction-tuned LLMs where specific trigger phrases cause the model to output toxic content or follow adversarial instructions, passing normal evaluation benchmarks because the trigger is absent from evaluation prompts.

The fine-tuning amplification problem

Most foundation model users don’t train from scratch — they fine-tune. This creates a threat amplification scenario: a backdoor embedded in the pretrained model persists through fine-tuning if the trigger is robust enough.

Yang et al. (2021) showed that backdoors survive fine-tuning unless the fine-tuning dataset is large relative to the number of poisoned samples and covers the trigger distribution. For typical domain-specific fine-tuning (hundreds to thousands of examples on a proprietary task), a pretrained model backdoor will survive nearly intact.

This matters because it removes the adversary’s need to poison downstream training data. If the adversary can poison the base model — by contributing to an open-source pretraining run, or by manipulating data that training providers scrape — the attack reaches every downstream fine-tune.

Activation-based detection methods

Neural Cleanse (Wang et al., 2019, arXiv:1908.00686 ↗) is the most widely referenced detection method. The intuition: if a model has a backdoor trigger, there exists a small perturbation of any input that causes it to classify as the target class — the trigger. Neural Cleanse searches for these small universal perturbations by optimization. A trigger is detected if any target class has an anomalously small universal perturbation.

Limitations: Neural Cleanse works well for pixel-level triggers but breaks on semantic triggers, frequency-domain triggers, and distributed triggers. The optimization problem is expensive and scales poorly to large models. It requires clean validation data.

STRIP (Gao et al., 2019) detects backdoor inputs at inference time by overlaying random natural images on the input — clean inputs should have variable predictions under perturbation; backdoor inputs whose prediction is dominated by the trigger should maintain high confidence on the target class even under strong perturbation. Effective for image triggers, less reliable for natural-language triggers.

Activation clustering (Chen et al., 2018, arXiv:1811.03728 ↗) clusters the internal activations of a model on training data and looks for anomalous clusters that are separated from the main cluster for their labeled class. Poisoned samples tend to cluster separately. This works against many backdoor variants but fails on distributed or semantic triggers.

Spectral signatures (Tran et al., 2018) detects poisoning by looking for outlier spectral components in the feature representations of training data. Poisoned samples often leave a consistent spectral signature separable from clean samples. Evaded by more sophisticated attacks designed to minimize spectral signatures.

What actually works in 2026

The adversarial cat-and-mouse between backdoor attacks and defenses is roughly where adversarial examples were in 2019: each defense is eventually broken by an adaptive attack. The current state is:

For production systems with full training control: data provenance tracking, staged training with activation monitoring, and Neural Cleanse or STRIP on model checkpoints provide meaningful but not complete coverage. Combine with clean-label detection (looking for high-loss training samples in their labeled class).

For foundation model users who can’t inspect the training pipeline: assume backdoor risk in any model pretrained on internet-scale data without verifiable provenance. Treat the pretrained model as an untrusted supplier and apply runtime monitoring — detection of distributional shifts between normal and triggered inputs — as the primary mitigation.

For fine-tuning pipelines: audit fine-tuning data for trigger patterns before training. The attack surface is smaller (fine-tuning data is usually controlled), but supply-chain risks in third-party fine-tuning datasets are real.

Runtime anomaly detection and output monitoring — flagging unexpected high-confidence outputs on unusual inputs — are covered in more depth at aidefense.dev ↗. Structured red-team techniques for testing whether a production model carries a backdoor are catalogued at aiattacks.dev ↗.

The supply chain framing

Data poisoning is fundamentally a supply chain attack. The adversary doesn’t compromise the model directly — they compromise a supplier: a data source, a dataset repository, a base model checkpoint, a fine-tuning service. The MITRE ATLAS framework includes data poisoning as a supply chain threat category, and NIST AI RMF has supply chain risk management as a core governance function.

The practical defense posture follows supply chain security logic: verify provenance, apply scrutiny to upstream suppliers proportional to their trust level, and assume compromise of the highest-risk suppliers. For teams training on internet-scale data, “assume some poisoning” is the correct default, not “scan for known patterns.”

References

Gu et al., “BadNets: Evaluating Backdooring Attacks on Deep Neural Networks” (2017), arXiv:1708.06733 ↗
Carlini et al., “Poisoning Web-Scale Training Datasets is Practical” (2023), arXiv:2302.10149 ↗
Turner et al., “Label-Consistent Backdoor Attacks” (2019), arXiv:1912.02771 ↗
Wang et al., “Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks” (2019), arXiv:1908.00686 ↗
Chen et al., “Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering” (2018), arXiv:1811.03728 ↗
Wan et al., “Poisoning Language Models During Instruction Tuning” (2023), arXiv:2305.00944 ↗

Data Poisoning and Backdoor Attacks on Foundation Models

The two main threat variants

Why foundation models expand the attack surface

Backdoor attack mechanics

The fine-tuning amplification problem

Activation-based detection methods

What actually works in 2026

The supply chain framing

References

Adversarial ML — in your inbox

Related

Evasion Attacks on Image Classifiers: FGSM, PGD, and C&W

Adversarial Robustness in NLP: Why Text Attacks Are Different

Adversarial Transferability: Why Black-Box Attacks Work at All

Comments