Evasion Attacks on Image Classifiers: FGSM, PGD, and C&W
The three foundational gradient-based evasion attacks, what each one actually optimizes, and what the benchmark numbers mean when you're evaluating a defense.
Adversarial examples for image classifiers have been a research fixture since Szegedy et al. (2013) found that imperceptible perturbations flip classifier predictions. A decade later, three attack algorithms dominate the evaluation literature: FGSM, PGD, and C&W. Understanding what each one optimizes, where it’s appropriate to use, and what the resulting numbers mean is foundational to reading any robustness claim honestly.
Why gradient-based attacks
Given a classifier f and an input x with true label y, an evasion attack finds a perturbed input x' = x + δ where:
f(x') ≠ y(the classifier is fooled)||δ||_p ≤ εfor some p-norm and budget ε (the perturbation is bounded)
This is a constrained optimization problem: find δ that maximizes the loss L(f(x + δ), y) subject to the constraint ||δ||_p ≤ ε.
Gradient-based attacks use the classifier’s gradient to guide the search. The gradient ∇_x L(f(x), y) points in the direction of steepest loss increase with respect to the input — exactly what the adversary wants to follow.
FGSM: the one-step baseline
Goodfellow et al. (2014, arXiv:1412.6572 ↗) proposed the Fast Gradient Sign Method as a conceptually simple one-step attack:
x' = x + ε · sign(∇_x L(f(x), y))
Take one gradient step in the direction that increases loss, scaled to the L-infinity budget ε. The sign function ensures every pixel is perturbed by exactly ε in the direction that increases loss, maximally using the L-infinity budget.
What FGSM actually is: a linearized approximation to the true worst-case perturbation. The loss is locally approximated as linear in the input, which is accurate to first order but misses curvature. For models with significant input-space curvature (which all deep neural networks have), FGSM undershoots the true worst-case loss.
When to use FGSM: generating adversarial examples for data augmentation during adversarial training (fast, cheap, produces diverse examples). As an attack evaluation, FGSM numbers are essentially meaningless for measuring robustness — any model can achieve misleadingly high accuracy against FGSM if it has nonlinear decision boundaries.
FGSM’s known weakness: the “gradient masking” failure mode. Many early defenses (adding noise, input preprocessing, obfuscated gradients) reduce FGSM attack success by making the gradient uninformative, not by being truly robust. Evaluating only against FGSM led to many false robustness claims in 2017-2018. Athalye et al.’s “Obfuscated Gradients” paper (arXiv:1802.00420 ↗) methodically broke these defenses using stronger iterative attacks.
PGD: projected gradient descent
Madry et al. (2018, arXiv:1706.06083 ↗) proposed PGD as a multi-step iterative extension of FGSM:
x^0 = x + uniform noise in [-ε, ε]
x^(t+1) = Π_{B(x,ε)}(x^t + α · sign(∇_x L(f(x^t), y)))
At each step: take an FGSM step of step-size α, then project back onto the L-infinity ball of radius ε around the original input (clip to [x - ε, x + ε] per-pixel). Run for k iterations from a random starting point within the perturbation budget.
Madry et al. characterized PGD as the “strongest first-order attack” — the strongest attack that uses only gradient information (not second-order curvature). They proved that against the L-infinity threat model, PGD finds a point near the worst-case first-order perturbation with high probability.
What PGD actually is: the inner loop of projected gradient ascent on the loss, constrained to the perturbation budget. With enough steps and small enough step size, it converges close to a local maximum of the loss within the ball. Starting from multiple random initializations and taking the worst-case adversarial example across restarts (PGD with restarts) approximates the global maximum more reliably.
PGD as a gold standard: because PGD with restarts finds near-first-order-worst-case perturbations, a model that is robust under PGD is either (a) genuinely robust, or (b) relies on higher-order features that first-order attacks can’t exploit. This makes PGD the de facto standard for evaluating L-infinity robustness. Robustness reported against PGD-20 (20 steps) or PGD-100 (100 steps) with random restarts is far more credible than FGSM robustness.
Standard evaluation settings:
- CIFAR-10, L-infinity ε = 8/255, PGD-20 with 10 restarts: a common benchmark since the Madry et al. paper
- ImageNet, L-infinity ε = 4/255, PGD-20 with 5 restarts: more expensive, less commonly done
- Adversarial training against PGD is the dominant empirical robustness approach and achieves ~55-60% robust accuracy on CIFAR-10 at ε = 8/255 for state-of-the-art models as of 2025
PGD’s blind spots: PGD is a first-order attack. It can fail when the loss landscape has many local maxima (gets stuck in shallow local optima) and when the gradient is inaccurate (gradient masking defenses, though adaptive evaluation usually fixes this). AutoAttack (below) is the current recommendation for rigorous evaluation.
C&W: the Carlini-Wagner attack
Carlini and Wagner (2017, arXiv:1608.04644 ↗) proposed a family of attacks that more directly optimize the desired outcome: find the minimum-norm perturbation that causes misclassification.
The C&W L2 attack solves:
minimize ||δ||_2 + c · g(x + δ)
where g(x') = max(max_{j≠y} Z(x')_j - Z(x')_y, -κ)
Z(x') is the vector of logits (pre-softmax activations), y is the true class, and g is positive when the classifier is fooled by at least margin κ. The constant c trades off perturbation size against attack success and is found by binary search.
The key insight: rather than projecting onto an L-infinity ball (as in PGD), C&W directly minimizes the L2 norm of the perturbation. This finds the smallest perturbation (in L2) that causes misclassification, which is a more precise characterization of the model’s vulnerability.
Why C&W matters:
-
Breaking “defensive distillation.” The paper was originally motivated by the need to break defensive distillation, which reduced the gradient signal by using high-temperature softmax. C&W operates on logits rather than softmax outputs, bypassing temperature-scaling obfuscation.
-
L2-norm adversarial examples. PGD produces L-infinity adversarial examples. C&W produces L2 adversarial examples — often visually more localized perturbations. L2 and L-infinity adversarial examples are measuring different aspects of model vulnerability; you need both to characterize robustness fully.
-
Confidence parameter κ. Setting κ > 0 finds perturbations that cause misclassification by at least margin κ. Higher-confidence adversarial examples transfer to other models more reliably.
C&W variants: the paper defines L0, L2, and L-infinity attacks with the same objective. The L2 variant is the most widely used because L2-norm minimization is a smooth optimization problem amenable to Adam or L-BFGS; L0 requires discrete optimization; L-infinity is less stable.
Computational cost: C&W is significantly more expensive than PGD per adversarial example. Binary search for c and iterative Adam optimization means 1000-10000 optimization steps per example. For evaluation on large datasets, PGD is standard; C&W is used for targeted attacks (specific misclassification target) and minimum-perturbation analysis.
AutoAttack: the current evaluation standard
In 2020, Croce and Hein released AutoAttack (arXiv:2003.01690 ↗) as a parameter-free, reliable ensemble evaluation. It combines:
- APGD-CE: PGD-like with an adaptive step-size schedule, optimizing cross-entropy
- APGD-T: targeted version optimizing for worst-case target class
- FAB: Frank-Wolfe based minimum perturbation attack
- Square: score-based black-box attack requiring no gradients
AutoAttack’s ensemble property means that if any component attack succeeds, the defense is broken. RobustBench (robustbench.mindrove.com ↗) uses AutoAttack as the standardized evaluation, making it the current gold standard for comparing adversarial robustness across papers. Any robustness claim that doesn’t include AutoAttack results should be viewed with skepticism.
What the benchmark numbers mean
Adversarial training gives ~55-60% robust accuracy on CIFAR-10 (L-infinity ε = 8/255) against AutoAttack. Clean accuracy is typically 80-85% for the same model. The “robustness gap” — clean accuracy minus robust accuracy — is real and large.
The Pareto frontier matters. More adversarial training epochs and data augmentation increase robust accuracy but decrease clean accuracy. Any comparison of defenses needs to control for clean accuracy, or compare on the Pareto frontier.
ε = 8/255 on CIFAR-10 is not a small perturbation. At ε = 8/255 L-infinity, the perturbation is visible in most images. The benchmark exists because it’s a challenging but solvable optimization problem; it doesn’t represent a typical attacker with a stealthiness constraint. For physically realizable attacks, smaller ε values and different norms (L2, perceptual norms) are more realistic.
Transfer rates are lower than direct attack rates. An adversarial example crafted against model A succeeds against model B at 20-60% of the rate it would against model A (depending on architecture similarity). Building adversarial examples against a substitute model and using them against a black-box target is an effective tactic for attacks against production ML APIs where gradients aren’t available — the techniques for constructing effective substitute models are covered at aiattacks.dev ↗.
Practical implications for defenders
For teams deploying image classifiers in adversarial settings:
- Don’t use FGSM as your only evaluation. It’s cheap to run and cheap to fake robustness against. Run AutoAttack or PGD with restarts.
- Adversarial training is the most defensible empirical approach. At the cost of clean accuracy, it produces models that are genuinely harder to attack with L-infinity perturbations. The methodology and tradeoffs for deploying adversarially trained models in production are covered at aidefense.dev ↗.
- Know your threat model. L-infinity robustness doesn’t imply L2 robustness. If your adversary crafts localized perturbations rather than distributed ones, L2 evaluation is more relevant.
- The certified robustness alternative. If you need a provable guarantee, not just empirical resistance, see the randomized smoothing post on this site — certified methods trade some clean accuracy for an unconditional mathematical bound.
References
- Goodfellow et al., “Explaining and Harnessing Adversarial Examples” (2014), arXiv:1412.6572 ↗
- Madry et al., “Towards Deep Learning Models Resistant to Adversarial Attacks” (2018), arXiv:1706.06083 ↗
- Carlini and Wagner, “Towards Evaluating the Robustness of Neural Networks” (2017), arXiv:1608.04644 ↗
- Athalye et al., “Obfuscated Gradients Give a False Sense of Security” (2018), arXiv:1802.00420 ↗
- Croce and Hein, “Reliable Evaluation of Adversarial Robustness with an Ensemble of Diverse Parameter-Free Attacks” (2020), arXiv:2003.01690 ↗
Adversarial ML — in your inbox
Working adversarial ML — exploits, defenses, and the gap between. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Data Poisoning and Backdoor Attacks on Foundation Models
Training data manipulation, backdoor triggers, and Trojan attacks against large-scale models. What the threat model actually requires and where the defenses are in 2026.
Adversarial Robustness in NLP: Why Text Attacks Are Different
Discrete input spaces, semantic constraints, and human-perceptibility rules change what counts as an adversarial example in text. The attacks are harder to define and harder to defend.
Adversarial Transferability: Why Black-Box Attacks Work at All
Adversarial examples transfer across models with different architectures and training sets. Understanding why changes what you think defenses need to accomplish.