Adversarial Training Methods: PGD-AT, TRADES, and MART
Adversarial training is the most defensible empirical robustness method, but 'adversarial training' isn't one thing.
Adversarial training is the most defensible empirical defense against adversarial examples — it is the approach that, unlike most published defenses, has largely survived adaptive re-evaluation. But “adversarial training” is not a single method. It is a family, and the members differ in what they optimize, how they treat different training examples, and where they land on the robustness-accuracy trade-off. Reading a robustness paper without knowing whether it used PGD-AT, TRADES, or MART is like reading a benchmark without knowing the test set.
The shared idea, and its cost
Every adversarial training method replaces the standard empirical risk minimization with a min-max objective: instead of minimizing loss on clean examples, minimize loss on the worst-case perturbation of each example within a bounded set. Madry et al. (arXiv:1706.06083 ↗) gave this the canonical formulation — an inner maximization that finds the strongest adversarial example (via PGD) wrapped in an outer minimization that updates the model on it. Train against the worst case and you become robust to it.
The unavoidable cost is the robustness-accuracy trade-off. A model trained to be robust to perturbed inputs gives up accuracy on clean inputs. On CIFAR-10 at L-infinity ε = 8/255, a state-of-the-art robust model sits around the high-50s to low-60s in robust accuracy against AutoAttack while clean accuracy is in the 80s — a gap of roughly 25 points. This is not an implementation defect; it reflects a genuine tension between fitting the clean data distribution and being invariant over a perturbation ball around each point. The methods below are different answers to how to navigate that trade-off, not attempts to eliminate it.
PGD-AT: the Madry baseline
PGD adversarial training is the reference method. For each minibatch, run PGD to generate adversarial examples at the training budget, then take a gradient step to minimize the model’s loss on those adversarial examples only. The clean examples are not in the loss; the model trains entirely on the worst case.
PGD-AT is robust and conceptually clean, and it remains a strong baseline that newer methods are measured against. Its characteristic weakness is the one its objective implies: by training only on the adversarial loss, it can sacrifice more clean accuracy than necessary and treats every example identically regardless of whether the model even classifies the clean version correctly. The two methods that follow are responses to those two limitations.
TRADES: decomposing the robust error
Zhang et al.’s TRADES (arXiv:1901.08573 ↗) starts from a theoretical decomposition. They show the robust error can be written as the sum of two terms: the natural error (the model’s error on clean inputs) and a boundary error (how often a perturbation pushes a correctly classified clean input across the decision boundary). This decomposition is the paper’s key idea, because it separates the two things adversarial training conflates.
TRADES optimizes them explicitly. The loss has two terms: a standard classification loss on the clean example (driving down natural error) and a regularization term — the KL divergence between the model’s output on the clean example and on its adversarial perturbation — that pushes the decision boundary away from the data (driving down boundary error). A hyperparameter, usually written as 1/λ or β, sets the trade-off: turn it up for more robustness at the cost of clean accuracy, down for the reverse.
This is why TRADES is more than an incremental tweak. It exposes the robustness-accuracy trade-off as a single tunable knob with a principled meaning, rather than an emergent side effect of training on adversarial examples. TRADES won the NeurIPS 2018 Adversarial Vision Challenge, and for years its variants anchored the top of robustness leaderboards. When a paper reports a robustness-accuracy trade-off curve, it is very often a TRADES β-sweep.
MART: not all examples are equal
Wang et al.’s MART — Misclassification Aware adveRsarial Training (OpenReview ↗, ICLR 2020) — attacks a different assumption. PGD-AT and TRADES treat every training example the same way. MART’s central observation is that misclassified examples — the ones the model gets wrong even on the clean version — have an outsized influence on final robustness, and they should be handled differently from correctly classified ones.
The paper’s empirical finding is sharp: the minimization technique applied to misclassified examples matters a great deal for final robustness, while the maximization technique (how you generate the adversarial example) on those same examples barely matters. This inverts the usual intuition that the inner attack is where the action is. MART operationalizes the insight with a loss that adds an explicit term emphasizing misclassified examples, combined with a regularizer that, like TRADES, encourages consistency between clean and adversarial outputs. MART also has a well-known semi-supervised extension that uses unlabeled data to push robustness further.
The takeaway that generalizes beyond MART itself: the examples your model already struggles with on clean data are disproportionately where adversarial vulnerability concentrates. Several later methods build on weighting or reweighting examples by difficulty, and MART is the reference point for that line of work.
What actually moves the frontier
If you read RobustBench (Croce et al., arXiv:2010.09670 ↗) chronologically, the top entries reveal what produces the largest robustness gains, and it is somewhat humbling for the loss-function literature: data and scale move the number more than the choice of loss.
- Extra training data — additional real data, or large quantities of synthetic data from a strong generative model — is the single biggest lever. Adversarial training is extraordinarily data-hungry, and feeding it more (especially generated) data raised CIFAR-10 robustness substantially. The robustness gain from a good synthetic-data pipeline often exceeds the gain from switching loss functions.
- Model capacity matters more for robust training than for clean training. Robust generalization needs bigger models; the same architecture that’s adequate for clean accuracy is undersized for robust accuracy.
- Training-time choices — longer schedules, weight averaging, careful augmentation — contribute meaningfully and stack with the loss-function choice.
This doesn’t make TRADES and MART irrelevant; the leaderboard leaders combine a principled loss with extra data and capacity. But it reframes the loss function as one factor among several, and not the dominant one. A practitioner choosing between PGD-AT, TRADES, and MART is choosing the trade-off knob and the example-weighting strategy; the larger gains come from the data and capacity decisions.
Choosing among them in practice
For a team deploying an adversarially trained model:
- Start with PGD-AT as a baseline to confirm your pipeline and threat model are sound, and to get a robustness floor.
- Use TRADES when you need an explicit, tunable robustness-accuracy trade-off — when the deployment has a clean-accuracy budget you can’t blow through, the β knob lets you sit precisely on the operating point you need.
- Reach for MART (or example-reweighting more broadly) when error analysis shows your vulnerability concentrates on a hard subset of inputs the model already misclassifies cleanly.
- Invest in data and capacity before over-tuning the loss. If you can add real or synthetic data and scale the model, that will move robust accuracy more than swapping objectives.
And evaluate all of it correctly — with AutoAttack as the floor and adaptive attacks against any novel component — as covered in the robustness evaluation post on this site. A robustness number from a weak attack tells you nothing about which training method actually won. The certified alternative, when you need a provable rather than empirical guarantee, is in the randomized smoothing post. For the attacker’s-eye view of what these defenses are resisting, see aiattacks.dev ↗, and for production deployment guidance once a method is chosen, aidefense.dev ↗.
The bottom line
“Adversarial training” names a family, not a method. PGD-AT trains on the worst case and is the baseline. TRADES decomposes robust error into natural and boundary terms and makes the robustness-accuracy trade-off a principled knob. MART recognizes that misclassified examples dominate robustness and weights them accordingly. All three navigate the same unavoidable trade-off rather than escaping it — and the largest gains on the leaderboard come from data and capacity stacked on top of whichever loss you pick. Know which method produced a robustness claim before you trust the number, because they are not interchangeable.
References
- Madry et al., “Towards Deep Learning Models Resistant to Adversarial Attacks” (2018), arXiv:1706.06083 ↗
- Zhang et al., “Theoretically Principled Trade-off between Robustness and Accuracy” (2019), arXiv:1901.08573 ↗
- Wang et al., “Improving Adversarial Robustness Requires Revisiting Misclassified Examples” (2020), OpenReview ↗
- Croce et al., “RobustBench: a standardized adversarial robustness benchmark” (2020), arXiv:2010.09670 ↗
See also
Sources
Adversarial ML — in your inbox
Working adversarial ML — exploits, defenses, and the gap between. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Adversarial Robustness in NLP: Why Text Attacks Are Different
Discrete input spaces, semantic constraints, and human-perceptibility rules change what counts as an adversarial example in text.
Evaluating Adversarial Robustness Without Fooling Yourself
Most defenses that claim robustness are later broken — not because the idea was bad, but because the evaluation was.
Embedding Inversion: Reconstructing Text From Vectors
Embedding inversion recovers the original text from a model's embedding vectors, breaking the assumption that embeddings are an opaque, privacy-safe