Certified Robustness via Randomized Smoothing: What 'Certified' Actually Guarantees
Randomized smoothing gives you a provable robustness radius. Understanding what that certificate means in practice — and where it breaks — is more useful than the headline number.
“Certified robustness” sounds like a guarantee. For randomized smoothing, it is a guarantee — but the guarantee has a specific shape that is frequently misread as broader than it is. If you’re evaluating a defense that claims certified robustness, understanding what the certificate covers and what it doesn’t is the only way to assess whether it matters for your threat model.
The setup
Standard adversarial robustness training aims to produce a model that correctly classifies inputs even when they’ve been perturbed by an adversary. Empirical defenses are evaluated by checking whether known attacks can find adversarial examples. The weakness: “no known attack found one” is not a proof.
Certified robustness takes a different approach: given an input x and a classifier f, prove that no perturbation within some ball B(x, r) can change the classification. The certificate is a mathematical proof, not an empirical observation. If the certificate says radius r = 0.5 in L2 norm, then no adversary can change the prediction by perturbing x within that ball — regardless of the attack algorithm.
Randomized smoothing (Cohen et al., 2019, arXiv:1902.02918 ↗) is the currently dominant approach to constructing such certificates for neural networks at scale.
How randomized smoothing works
The key idea is to smooth the classifier by averaging its predictions over Gaussian noise. Define the smoothed classifier g as:
g(x) = argmax_c P(f(x + ε) = c) where ε ~ N(0, σ²I)
The smoothed classifier returns whichever class the base classifier predicts most often when x is perturbed by Gaussian noise with standard deviation σ.
Cohen et al. proved the following: if g(x) = c_A with probability p_A (the probability that the base classifier returns class A under Gaussian noise), then g is certifiably robust around x with radius:
r = (σ/2) * (Φ⁻¹(p_A) - Φ⁻¹(p_B))
where p_B is the highest probability for any other class, and Φ⁻¹ is the inverse normal CDF.
In practice, you estimate p_A by sampling: run the base classifier on N noisy copies of x, count how many times each class wins, and use a one-sided confidence interval to get a lower bound on p_A. If the lower bound on p_A is high enough to certify a nonzero radius with high confidence, you have a certified prediction.
Abstaining is always allowed: if the certification confidence is too low, the procedure abstains rather than making an uncertified prediction. Certified accuracy is therefore reported as the fraction of test inputs that are both correctly classified and certified with nonzero radius.
What the certificate actually proves
The certificate for a given (x, r) pair states: under L2 perturbations of magnitude at most r, the smoothed classifier g will return the same class c_A. This proof holds with probability at least 1 - alpha over the sampling randomness (commonly alpha = 0.001).
This is a meaningful guarantee. Empirical attacks that find adversarial examples within the certified radius are impossible by construction — if an attack claims to have found one, it made an error or the certificate was mis-applied.
What the certificate does not prove:
Certifications are point-wise. The radius r is valid for this specific input x. A nearby input x' may have a smaller or zero certified radius. A certified defense is not uniformly robust over the input space.
The base classifier can still be fooled outside the radius. The certificate is void for perturbations larger than r. For many realistic inputs, the certified radius is small enough that practical attacks can exceed it.
Smoothing changes the classifier. The smoothed classifier g is not the same as the base classifier f. Smoothing often reduces clean accuracy relative to the base classifier. The certified accuracy metric masks this: a correctly classified and certified point passes, but the smoothed classifier may misclassify inputs the base classifier gets right.
L2 vs. L-infinity
The Cohen et al. certificate is specifically for L2 perturbations. This matters enormously for practice.
Adversarial attacks are frequently described in terms of L-infinity norm — the maximum per-pixel perturbation. An L-infinity attack with epsilon = 8/255 on an image (a standard evaluation setting for ImageNet) is not bounded by the L2 certificate unless the L2 radius converts to a larger L-infinity bound.
For a 224x224x3 image (ImageNet), the L2 norm of an L-infinity epsilon = 8/255 perturbation can be as large as (224 * 224 * 3)^0.5 * (8/255) ≈ 3.66. State-of-the-art randomized smoothing on ImageNet achieves certified L2 radii of roughly 0.5-1.0 at 70-80% accuracy. An L-infinity attack at epsilon = 8/255 can have L2 norm up to 3.66, which is well outside the certified radius.
There is work on extending randomized smoothing to L-infinity certificates (Levine and Feizi, arXiv:2003.02460 ↗) using different noise distributions, but the resulting L-infinity certified radii are much smaller than the L2 equivalents for the same accuracy.
Where certified robustness breaks at realistic epsilon
The practical gap is wide. On ImageNet with sigma = 0.5:
- Clean accuracy of the smoothed classifier: ~67% (vs. ~79% for an unsmoothed ResNet-50)
- Certified accuracy at L2 radius 0.5: ~49%
- Certified accuracy at L2 radius 1.0: ~37%
An L-infinity attack with epsilon = 4/255 can have L2 norm up to ~1.83 on ImageNet images, which is beyond the radius where certified accuracy is meaningful.
For CIFAR-10 the numbers are better in relative terms but the gap persists. The honest takeaway: the certified accuracy at practically relevant perturbation sizes is substantially lower than empirical accuracy under known attacks, and lower than clean accuracy.
This is not a flaw in randomized smoothing as a method — it’s an accurate picture of what is provably achievable. The flaw is in presenting certified robustness as a near-complete defense.
The practical gap vs. empirical defenses
Empirical defenses (adversarial training, input preprocessing) often report higher robustness under specific attacks at specific epsilon values than randomized smoothing’s certified accuracy at the same epsilon. The comparison is unfair: empirical robustness numbers can be broken by new adaptive attacks; certified robustness numbers cannot.
Athalye et al.’s “Obfuscated Gradients” paper (arXiv:1802.00420 ↗) showed that many pre-2018 defenses that reported strong empirical robustness were broken by adaptive attacks that the original evaluations didn’t consider. Certified robustness is immune to this problem by construction.
The tradeoff is real: you get a weaker guarantee (smaller certified radius), but the guarantee is unconditional. For use cases where the adversary’s perturbation budget is genuinely small (L2 radius < 0.5), certified robustness via smoothing is the most defensible approach. For use cases where the relevant threat involves larger perturbations, neither certified nor empirical defenses hold up at production accuracy. Engineering guidance on deploying layered empirical defenses — adversarial training, input filtering, and output monitoring — is covered at aidefense.dev ↗.
What “certified” is useful for in practice
The certificate is most useful in three scenarios:
Regulatory or contractual requirements for provable guarantees. If a downstream contract or regulation requires a verifiable robustness claim, certified methods are the only defensible choice. An empirical claim can be challenged by demonstrating an attack; a certified claim cannot.
Benchmarking and comparisons. Certified accuracy provides an apple-to-apple comparison metric that doesn’t depend on the choice of attack algorithm. It’s a cleaner comparison basis than “accuracy under PGD with 10 steps vs. 100 steps.” Standardized results across certified and empirical methods are tracked in the benchmark database at aisecbench.com ↗.
Identifying inputs with no robustness guarantee. Randomized smoothing identifies inputs where the smoothed classifier abstains or has a very small certified radius. These are the inputs most at risk. Using the certified radius as a risk signal for individual predictions is a practical application.
References
- Cohen et al., “Certified Adversarial Robustness via Randomized Smoothing” (2019), arXiv:1902.02918 ↗
- Levine and Feizi, “Tight Second-Order Certificates for Randomized Smoothing” (2020), arXiv:2003.02460 ↗
- Athalye et al., “Obfuscated Gradients Give a False Sense of Security” (2018), arXiv:1802.00420 ↗
- Carlini et al., “Certified Defenses Are Not Attacking Right” (2022), arXiv:2202.07696 ↗
- Yang et al., “Randomized Smoothing of All Shapes and Sizes” (2020), arXiv:2002.08118 ↗
Adversarial ML — in your inbox
Working adversarial ML — exploits, defenses, and the gap between. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Data Poisoning and Backdoor Attacks on Foundation Models
Training data manipulation, backdoor triggers, and Trojan attacks against large-scale models. What the threat model actually requires and where the defenses are in 2026.
Evasion Attacks on Image Classifiers: FGSM, PGD, and C&W
The three foundational gradient-based evasion attacks, what each one actually optimizes, and what the benchmark numbers mean when you're evaluating a defense.
Adversarial Robustness in NLP: Why Text Attacks Are Different
Discrete input spaces, semantic constraints, and human-perceptibility rules change what counts as an adversarial example in text. The attacks are harder to define and harder to defend.