Evaluating Adversarial Robustness Without Fooling Yourself

There is a recurring pattern in adversarial ML that should make anyone reading a robustness claim cautious: a defense is published with a strong robust-accuracy number, and within a year — sometimes within weeks — someone breaks it with a stronger attack the original authors didn’t run. The defense’s idea wasn’t necessarily wrong. The evaluation was. Carlini et al.’s “On Evaluating Adversarial Robustness” (arXiv:1902.06705 ↗) opens with the blunt observation that most proposed defenses are quickly shown to be incorrect, and the failure is methodological. Understanding why is the difference between measuring robustness and fooling yourself.

The thing you are actually measuring

Robust accuracy is the fraction of test inputs the model classifies correctly under attack. The number is only meaningful relative to the attack used to compute it. “95% robust accuracy” is not a property of the model — it is a property of the model and the attack you ran. A weak attack produces a high robust-accuracy number against any model, robust or not. The entire game is whether your attack is strong enough that its failure to find an adversarial example is evidence the example doesn’t exist, rather than evidence your attack was bad.

This reframes the evaluation as a search problem with a one-sided error. If the attack finds an adversarial example, the model is definitively non-robust on that input. If the attack fails to find one, you’ve learned almost nothing unless you have reason to believe the attack would have found one if it existed. The whole methodology is about earning the right to interpret attack failure as robustness.

Gradient masking: the trap that breaks evaluations

The single most common way evaluations go wrong is gradient masking — also called obfuscated gradients. Athalye et al. (arXiv:1802.00420 ↗) named and dissected it. A defense exhibits gradient masking when it makes the model’s gradients uninformative for the attacker without making the model genuinely robust. Gradient-based attacks then fail, robust accuracy looks high, and the defense appears to work — until an attack that doesn’t rely on the masked gradient walks right through it.

The paper identifies the mechanisms: shattered gradients (non-differentiable preprocessing that breaks backpropagation), stochastic gradients (randomness that makes a single gradient estimate noisy), and exploding or vanishing gradients (very deep or recurrent computation that destabilizes the gradient). Each makes the standard PGD attack underperform, and each is defeated by a tailored adaptive attack: backward-pass differentiable approximation for shattered gradients, expectation-over-transformation for stochastic ones.

Athalye et al. took the defenses presented at a top venue that relied on these effects and broke almost all of them. The lesson institutionalized by that paper: if your defense reduces attack success and you can’t explain why the attack failed in terms other than “the model is robust,” suspect gradient masking. There is a checklist of warning signs — single-step attacks outperforming iterative ones, black-box attacks outperforming white-box, unbounded attacks failing to reach 100% success, increasing the perturbation budget not increasing attack success. Any of these means the gradient is lying to your attacker, and your robust-accuracy number is fiction.

Adaptive attacks: the non-negotiable

The core methodological requirement from Carlini et al. is the adaptive attack: an attack designed with full knowledge of the defense, specifically to defeat it. Evaluating a defense against attacks that don’t know the defense is in place tells you nothing about a real adversary, who will know. To assemble the right adaptive threat model before you measure anything, the Attack Selector maps the controls you have in place to the attacks that are designed to push past them.

This is harder than it sounds, because it requires the defender to genuinely try to break their own defense. A defense with a non-differentiable component must be evaluated with an attack that approximates or circumvents that component, not with a stock PGD that the component happens to block. A randomized defense must be evaluated with an attack that accounts for the randomization. The paper is explicit that this is intellectual work, not a button you press: you have to understand the mechanism of your defense and construct the attack that targets that mechanism.

The recommendations read as a discipline: use the strongest available attacks, verify the attack is working by sanity-checking that it reaches near-100% success against an undefended baseline and against the defense at large perturbation budgets, report attack hyperparameters and the number of iterations, and release models so others can attempt to break them. The release point matters — the field’s correction mechanism is independent re-evaluation, and a defense that isn’t released can’t be independently checked.

AutoAttack: the parameter-free standard

Adaptive attacks require expertise and effort, which means they’re done inconsistently. AutoAttack (Croce and Hein, arXiv:2003.01690 ↗) addresses this by providing a strong, parameter-free, reliable ensemble that catches the common failure modes without per-defense hand-tuning. It combines four attacks:

APGD-CE — an auto-stepped PGD optimizing cross-entropy loss, removing the step-size tuning that often makes hand-run PGD underperform.
APGD-T — a targeted variant optimizing the difference-of-logits-ratio loss, which is more robust to gradient scaling and to defenses that exploit cross-entropy’s quirks.
FAB — a minimum-norm attack that finds the closest decision boundary, useful when loss-maximizing attacks stall.
Square Attack — a score-based black-box attack using random search, which needs no gradient at all and therefore catches gradient-masking defenses that defeat the three gradient-based components.

The ensemble property is the point: a model is counted robust on an input only if all four attacks fail. Because Square requires no gradient, AutoAttack has a built-in check against the most common evaluation trap. It is not a substitute for a genuine adaptive attack against a novel defense mechanism — Croce and Hein are clear that AutoAttack is a strong default, not a proof — but it is dramatically harder to fool than a single hand-run PGD, and it makes results comparable across papers.

RobustBench: standardized, leaderboarded, re-evaluated

RobustBench (Croce et al., arXiv:2010.09670 ↗) operationalizes all of this into a standardized benchmark. It evaluates submitted models with AutoAttack under fixed threat models (L-infinity at ε = 8/255 and L2 at ε = 0.5 on CIFAR-10, among others), maintains a leaderboard of 120+ models, and — critically — provides a model zoo so the community can independently attack any entry. The standardization removes the degrees of freedom that let weak evaluations inflate numbers: everyone is attacked the same way, at the same budget, by the same strong ensemble.

What RobustBench reveals when you read it honestly: state-of-the-art robust accuracy on CIFAR-10 at L-infinity ε = 8/255 sits in the ballpark of the high-50s to low-60s percent, against clean accuracies in the 80s-90s. The robustness gap is real and stubborn. Reported numbers far above the leaderboard for the same threat model should be treated as a claim that hasn’t survived AutoAttack, not as a breakthrough — until it’s on the board.

A practical evaluation checklist

For anyone evaluating a defense before trusting it:

State the exact threat model — norm, budget, and access. A defense robust at L-infinity ε = 8/255 is making no claim about L2 or about larger budgets.
Run AutoAttack as the floor, not a single PGD. If you can’t beat AutoAttack, you don’t have a robust model.
Build an adaptive attack against any novel mechanism in the defense — non-differentiable layers, randomization, detection components. Stock attacks don’t probe these.
Run the gradient-masking sanity checks. Does a stronger attack help? Does a larger budget reach 100% success? Does black-box underperform white-box? If the answers are wrong, your gradient is masked.
Compare against RobustBench for the same threat model. A number that beats the leaderboard without an independent re-evaluation is a hypothesis, not a result.

Standardized cross-method robustness results that make these comparisons legible across both empirical and certified defenses are tracked at aisecbench.com ↗. The attack mechanics underlying the evaluation — FGSM, PGD, and C&W — are covered in the evasion attacks post on this site, and the black-box transfer setting that motivates the Square component is detailed at aiattacks.dev ↗. For deploying empirical defenses in production once they’ve actually been measured, see aidefense.dev ↗.

The bottom line

A robustness number is a claim about an attack as much as about a model. Most broken defenses weren’t bad ideas — they were under-attacked, usually by an evaluation that gradient masking quietly defeated. The discipline that prevents self-deception is fixed: precise threat models, AutoAttack as the floor, genuine adaptive attacks against novel mechanisms, the gradient-masking sanity checks, and independent re-evaluation through RobustBench. Skip any of these and the number you report is measuring your evaluation’s weakness, not your model’s strength.

References

Carlini et al., “On Evaluating Adversarial Robustness” (2019), arXiv:1902.06705 ↗
Athalye et al., “Obfuscated Gradients Give a False Sense of Security” (2018), arXiv:1802.00420 ↗
Croce and Hein, “Reliable Evaluation of Adversarial Robustness with an Ensemble of Diverse Parameter-Free Attacks” (2020), arXiv:2003.01690 ↗
Croce et al., “RobustBench: a standardized adversarial robustness benchmark” (2020), arXiv:2010.09670 ↗

Evaluating Adversarial Robustness Without Fooling Yourself

The thing you are actually measuring

Gradient masking: the trap that breaks evaluations

Adaptive attacks: the non-negotiable

AutoAttack: the parameter-free standard

RobustBench: standardized, leaderboarded, re-evaluated

A practical evaluation checklist

The bottom line

References

See also

Sources

Adversarial ML — in your inbox

Related

Adversarial Training Methods: PGD-AT, TRADES, and MART

Embedding Inversion: Reconstructing Text From Vectors

Adversarial Examples vs. Data Poisoning: Timing Is Everything

Comments