Adversarial Examples vs. Data Poisoning: Timing Is Everything
Adversarial examples attack a deployed model at inference; data poisoning attacks the model before it is deployed.
Adversarial examples and data poisoning are routinely grouped under “adversarial ML,” and that grouping leads teams to apply the wrong defense to the wrong attack. They share a goal — making a model produce attacker-chosen outputs — but they happen at different points in the ML lifecycle, require different attacker capabilities, and demand fundamentally different defenses.
The most important distinction is timing. Adversarial examples happen at inference. Data poisoning happens at training. That single difference reshapes everything downstream: who can attack you, what they need, when you can detect it, and what you can do about it.
Adversarial Examples: Inference-Time Evasion
An adversarial example is an input designed to fool a deployed model. The model’s weights are fixed. The training pipeline is done. The attacker only gets to choose what to feed the model at inference time.
The canonical attack: take a real image of a panda, add a carefully computed perturbation that is invisible to humans, and the model now confidently classifies it as a gibbon. The perturbation is computed by following the gradient of the model’s loss with respect to the input — essentially asking, “what tiny change to this image would most increase the probability of the wrong class?” Goodfellow et al.’s FGSM (2015) ↗ formalized this; PGD (Madry et al., 2018) and Carlini-Wagner refined it.
Attack surface: Anywhere the model accepts input. APIs, file uploads, camera feeds, sensor streams.
Attacker capability needed: Query access at minimum. White-box access (model weights) makes the attack trivially easy. Black-box access requires more queries to estimate gradients, but is still tractable — and transferable adversarial examples crafted on a surrogate model often fool the target with no queries at all.
Per-input cost: Computed fresh for each target input. An adversarial example for image A does not generalize to image B (with the exception of universal adversarial perturbations, which are weaker but reusable).
Detection window: At inference, in real time. You either catch the malicious input as it arrives, or it succeeds.
Data Poisoning: Training-Time Compromise
Data poisoning is the corruption of a model’s training data. The attacker injects malicious samples into the training set before the model is trained. The resulting model has the attacker’s behavior baked into its weights — there is no malicious input to detect at inference, because the model itself is the malicious artifact.
Two main variants:
Targeted poisoning degrades model accuracy on a specific class or input distribution. A handful of mislabeled samples can shift the decision boundary enough to systematically misclassify a target.
Backdoor (Trojan) attacks embed a trigger pattern in poisoned samples and associate it with an attacker-chosen output. After training, the model behaves normally on clean inputs but produces the attacker’s chosen output whenever the trigger is present. Gu et al.’s BadNets (2017) ↗ demonstrated this with image classifiers; the same dynamics apply to LLMs trained on poisoned text corpora.
Attack surface: Anywhere training data comes from. For foundation models trained on web-scraped corpora like Common Crawl or LAION-5B, this means any URL the crawler indexes. Carlini et al.’s 2023 work ↗ showed that controlling as little as 0.01% of pretraining data is enough to induce measurable downstream behavior changes.
Attacker capability needed: Write access (direct or indirect) to some portion of the training data pipeline. This is harder than it sounds for internally curated datasets, but trivial for web-scale scraped corpora.
Per-deployment cost: One poisoning campaign compromises every model trained on that data. The cost is amortized across every downstream user.
Detection window: Pre-training (data audit), during training (anomaly detection in loss curves), or post-training (backdoor scanning, fine-tuning probes). After deployment, the model behaves normally until the trigger appears — at which point the bad behavior is indistinguishable from a normal prediction.
Side-by-Side
| Dimension | Adversarial Examples | Data Poisoning |
|---|---|---|
| When does the attack happen? | At inference, after deployment | At training, before deployment |
| What does the attacker need? | Query access (often gradient access for strong attacks) | Write access to some training data |
| What is compromised? | A single inference for a single user | Every inference, every user, for the model’s lifetime |
| How many inputs? | One adversarial input per attack | One poisoning campaign, many downstream victims |
| Persistence | Stateless — each attack independent | Persistent — baked into the model’s weights |
| Detection point | At inference time, in the input stream | At training time, in the data; or post-hoc, in the model |
| Primary defense | Adversarial training, input preprocessing, certified robustness | Data provenance, training-data auditing, backdoor scanning |
| Who owns the fix? | The application layer (inference-time defenses) | The MLOps/data layer (training-time controls) |
The Defenses Don’t Overlap
The most consequential implication of the timing distinction is that defenses for one class do almost nothing against the other.
Adversarial training — exposing the model to adversarial examples during training and forcing it to classify them correctly — is the dominant defense against evasion attacks. It buys robustness within an ε-ball around clean inputs. It does nothing about a poisoned training set; if the trigger is in the training data, adversarial training will happily learn it too.
Certified robustness approaches like randomized smoothing provide provable guarantees against ε-bounded perturbations at inference. They say nothing about whether the model itself was trained on tampered data.
Input preprocessing (JPEG compression, feature squeezing, denoising) disrupts the carefully tuned perturbations of adversarial examples. It does not detect or remove a learned backdoor.
Conversely, defenses against poisoning operate on the data and the model, not the input:
Data provenance and curation. Track the origin of every training sample. For foundation models, this means content addressing (hash-pinned URLs), authenticated sources, and human-reviewed subsets for high-stakes domains.
Anomaly detection during training. Monitor per-sample loss; poisoned samples often have anomalously high or low loss values. Cluster-based defenses flag samples whose representations cluster suspiciously near a target class.
Backdoor scanning post-training. Neural Cleanse, ABS, and related techniques attempt to recover trigger patterns from a trained model by searching for inputs that produce maximally confident outputs on a target class. See data poisoning and backdoor attacks for the full taxonomy.
Fine-tuning on a small clean dataset. Catastrophic forgetting can erase backdoors learned during pretraining, at the cost of some clean-task performance.
None of these inference-time defenses catch training-time attacks, and none of these training-time defenses catch inference-time attacks.
When Both Are in Play
Real-world deployments face both threats simultaneously, and the threats can compose.
A model that has been backdoored via training-time poisoning will still be vulnerable to inference-time adversarial examples on its clean behavior — and adversarial training applied to the poisoned model will not remove the backdoor. Worse, an attacker who knows a model was adversarially trained on a specific ε-ball can construct out-of-distribution inputs that the certified bound doesn’t cover.
The right mental model: adversarial examples and data poisoning are orthogonal threats that require independent defenses. Budget for both. A team that ships an adversarially robust model trained on unaudited web data has solved one half of the problem and pretended the other half doesn’t exist.
Operational Checklist
When threat-modeling an ML system, separate the two questions:
- Could an attacker influence the training data? If yes — and for any model trained on internet-scale scraped data, the answer is yes — you need data provenance controls, anomaly detection during training, and backdoor scanning before deployment.
- Could an attacker craft adversarial inputs at inference? If the model is reachable via an API, a file upload, or a sensor, the answer is yes. You need adversarial training, input preprocessing, and ideally certified robustness for high-stakes inputs.
The two answers are independent. Defenses for one are not defenses for the other.
→ See also: Evasion Attacks: FGSM, PGD, and Carlini-Wagner for inference-time attack mechanics. Data Poisoning and Backdoor Attacks on Foundation Models for the training-time threat in depth. Certified Robustness via Randomized Smoothing for provable inference-time defenses. For the full map of attacks grouped by where in the ML lifecycle they strike, browse the adversarial ML topics index.
See also
Sources
- Explaining and Harnessing Adversarial Examples (Goodfellow et al., 2015)
- Towards Deep Learning Models Resistant to Adversarial Attacks (Madry et al., 2018)
- BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain (Gu et al., 2017)
- Poisoning Web-Scale Training Datasets is Practical (Carlini et al., 2023)
Adversarial ML — in your inbox
Working adversarial ML — exploits, defenses, and the gap between. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Data Poisoning and Backdoor Attacks on Foundation Models
Training data manipulation, backdoor triggers, and Trojan attacks against large-scale models. What the threat model actually requires and where the
Evasion Attacks on Image Classifiers: FGSM, PGD, and C&W
The three foundational gradient-based evasion attacks, what each one actually optimizes, and what the benchmark numbers mean when you're evaluating a defense.
Adversarial Transferability: Why Black-Box Attacks Work at All
Adversarial examples transfer across models with different architectures and training sets. Understanding why changes what you think defenses need to