UAR: Measuring Neural Network Robustness Against Attacks You Haven't Seen Yet

The standard robustness benchmark is broken by design. Train a defense against PGD attacks, evaluate on PGD attacks, publish a number. That number tells you one thing: how well the model does against the exact threat it was tuned for. It says almost nothing about what happens when an adversary brings a different tool.

OpenAI’s work on UAR — Unforeseen Attack Robustness — directly targets this gap. The core question is not “can this classifier survive the attacks we know about?” but “does robustness transfer to attacks we haven’t anticipated?” The answer, for most contemporary defenses, is a qualified no.

The Evaluation Gap

Most adversarial robustness papers use a closed-world evaluation protocol. A model is trained with some defense — adversarial training, certified defenses, input preprocessing — and then tested against the same attack family or close variants. Adversarial training against L∞ PGD makes the model robust to L∞ PGD. That result often does not generalize to L2 attacks, L0 sparse attacks, unrestricted perturbations, or novel geometric transformations.

This is not a minor technical complaint. It is the core failure mode when defenses meet production adversaries. An attacker who can read your defense paper now knows which attack family you hardened against — and can probe for the orthogonal ones you didn’t test.

The field has been aware of transfer gaps between attack families since at least Madry et al. (2018) ↗, which showed PGD-hardened models could still be broken by slightly different perturbation budgets. What OpenAI’s UAR work adds is a systematic methodology for measuring how wide that gap is — across a diverse set of unforeseen attack types applied to a single fixed model.

What UAR Measures

UAR (Unforeseen Attack Robustness) is a summary metric over performance against a held-out population of attacks — attacks the model was not trained or tuned against. The evaluation protocol decouples the training threat model from the evaluation threat model.

The setup:

Train (or harden) a classifier under a fixed threat model — e.g., L∞ adversarial training.
Construct an evaluation suite of structurally distinct attack families: different Lp norms, JPEG-based perturbations, spatial transforms, color shifts, and other unforeseen perturbations.
For each unforeseen attack, compute attack success rate (or equivalently, classifier accuracy under attack) with no access to the defense during attack construction.
Aggregate those per-attack accuracies into UAR.

The aggregation matters. A single metric across diverse attacks penalizes the “train against one thing, break against another” failure mode. A model with high PGD robustness but near-zero robustness on spatial attacks will score poorly on UAR, even if it posts impressive L∞ numbers on standard benchmarks.

The metric is intentionally model-centric — it evaluates a single frozen model, not a defense mechanism in general. This is the right framing for production deployment, where the question is whether a specific versioned model will hold up to novel attack attempts.

Why Current Defenses Fail Unforeseen Attacks

The mechanism is not mysterious. Adversarial training creates robustness by covering the loss landscape in directions that match the training attack. An L∞ PGD-trained model has smooth loss surfaces along the L∞ ball boundary, but that smoothness is localized. Outside the trained threat model, sharp gradients can re-emerge.

Certified defenses (randomized smoothing, interval bound propagation) provide guarantees within a specific norm ball, and those guarantees are provably vacuous outside it. Randomized smoothing ↗, for instance, certifies L2 robustness; it provides no formal guarantee against L0 sparse perturbations or spatial transformations that move pixels rather than perturbing their values.

Input preprocessing defenses are arguably the most vulnerable class here. Defenses that denoise or quantize inputs before classification can be bypassed by attacks constructed against the preprocessing pipeline itself — a known failure mode for JPEG compression defenses and bit-depth reduction. An unforeseen attack that targets the denoiser rather than the classifier can still degrade performance without being caught by standard evaluation.

Implications for Defense Design

The UAR framing changes what “good robustness” looks like as a design target.

Under standard evaluation, optimizing adversarial training for the benchmark attack is rational. UAR removes that incentive by evaluating against attacks the designer cannot anticipate. This pushes toward defenses that are generically robust rather than attack-specific.

The most promising directions for UAR-aware design are:

Attack diversity in training. Training against a broader distribution of attack types — mixed Lp norms, spatial transforms, and corruption-based perturbations — tends to produce more transferable robustness than single-attack training. Work on corruption robustness benchmarks ↗ (Hendrycks & Dietterich) demonstrated that diverse training distributions improve generalization across distribution shifts, and similar logic applies to adversarial threat models.

Certified defenses with wider norm coverage. Smoothing-based approaches can, in principle, be extended to cover broader threat models, though at a cost to clean accuracy and certified radius. Research combining L2 smoothing with semantic similarity guarantees is one active direction here.

Auxiliary invariance training. Building in symmetry to transformations (rotation, scaling, color jitter) via augmentation or explicit architectural inductive biases provides some structural protection against spatial attack families without targeted adversarial training.

None of these is a solved problem. The honest position is that no current defense achieves high UAR across all unforeseen attack families simultaneously, and the tradeoffs between attack-specific hardening and general robustness remain poorly characterized.

Measurement as the Immediate Priority

The most actionable output of the UAR work is not a new training recipe — it is a call to change evaluation practice. Security engineers deploying ML classifiers in adversarial settings should:

Test against attacks outside the training threat model. If the model was hardened against L∞ perturbations, run L2, L1, and spatial attack baselines before treating it as production-hardened.
Report per-attack-family accuracy, not just aggregate numbers. Aggregate robustness scores hide the “99% robust to one attack, 0% to another” failure pattern that UAR is designed to surface.
Use held-out attack families in red-team evaluations. A red team that only probes the defense with the attacks disclosed in the model card is not finding the real risk surface. The threat intelligence value is in what the defender didn’t anticipate.
Treat robustness claims as threat-model-scoped. A claim of “adversarially robust” without specifying the attack family is not a useful security property. Match the attack family to the deployment threat model, and clearly document what’s out of scope.
Track UAR over model versions. If a new defense improves L∞ accuracy but drops performance on spatial or L0 attacks, that regression may not show up in standard benchmarks. UAR-style evaluation surfaces it.

The broader lesson is that adversarial ML evaluation needs the same threat-model rigor that security engineering applies elsewhere. A firewall rule that blocks one port while leaving adjacent ports open is not “partially secure” — it fails the adversary who knows the gap. The same reasoning applies to attack-specific robustness claims.

Sources

Testing robustness against unforeseen adversaries — OpenAI ↗: The primary source for UAR methodology and the motivation behind measuring robustness against a diverse held-out attack suite.
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations (Hendrycks & Dietterich) ↗: Established the corruption robustness benchmark that demonstrated robustness to adversarial perturbations does not generalize to distribution shift corruptions — a parallel finding to UAR’s core claim.
Towards Deep Learning Models Resistant to Adversarial Attacks (Madry et al.) ↗: Introduced PGD-based adversarial training, the baseline that most robustness work is measured against, and implicitly set up the single-attack-family evaluation paradigm that UAR critiques.

UAR: Measuring Neural Network Robustness Against Attacks You Haven't Seen Yet

The Evaluation Gap

What UAR Measures

Why Current Defenses Fail Unforeseen Attacks

Implications for Defense Design

Measurement as the Immediate Priority

Sources

Sources

Adversarial ML — in your inbox

Related

Universal Adversarial Perturbations: One Vector That Fools Inputs

Embedding Inversion: Reconstructing Text From Vectors

Adversarial Training Methods: PGD-AT, TRADES, and MART

Comments