Adversarial Patch Attacks: Physical Perturbations That Fool ML

Most discussions of adversarial examples focus on imperceptible perturbations: noise bounded in L_∞ or L_2 norm, invisible to human observers. Per-image gradient attacks and universal adversarial perturbations both operate in this regime. Adversarial patches break the imperceptibility assumption entirely: the perturbation is large, visible, and localized — a rectangular or irregularly-shaped sticker that can be printed and physically placed in the real world.

The adversarial patch attack was introduced by Brown et al. in their 2017 paper “Adversarial Patch” (arXiv:1712.09665 ↗). The key insight: if you drop the imperceptibility constraint and allow a large localized perturbation, you can optimize for attacks that survive physical-world transformations — rotation, scale changes, varying illumination, camera noise, and JPEG compression. A patch that works only in digital pipelines is fragile. A patch robust to all these transforms can be printed and used to fool real cameras.

What makes patches different

The threat model for adversarial patches differs from standard adversarial examples in several important ways:

No norm constraint. Standard attacks are constrained by ||δ||_p ≤ ε. Patch attacks instead constrain the perturbation to a spatial region (the patch footprint) and apply no norm bound within that region. The patch can use any pixel values.

Location and scale invariance. The patch must work when placed at arbitrary locations and orientations in the image. An adversary printing a sticker cannot guarantee precise placement. The optimization therefore averages over expected placements.

Physical-world robustness. A patch rendered on a printed surface encounters: additive lighting variations (ambient, directional), multiplicative lighting effects (specular reflection on glossy materials), perspective distortion as camera angle changes, motion blur, camera sensor noise, and lossy compression downstream. The patch must survive all of these.

Semantic visibility. The patch is clearly visible. The defense is that the patch doesn’t look like the target object — it looks like a random colored rectangle or, in more sophisticated attacks, like an innocuous object (a sticker, a piece of art, a small sign) that wouldn’t trigger suspicion.

Computing adversarial patches

Brown et al.’s optimization uses Expectation over Transformation (EOT), a technique developed by Athalye et al. to optimize perturbations robust to stochastic transforms (arXiv:1707.07397 ↗):

Objective: maximize E_{x~X, t~T, l~L}[log P(ŷ | A(p, x, l, t))]

Where:
  p = patch (the variable being optimized)
  x = background image sampled from dataset X
  t = transformation sampled from transformation distribution T
     (rotation, scale, translation, perspective)
  l = patch location sampled from location distribution L
  A(p, x, l, t) = image x with patch p applied at location l, then transformed by t
  ŷ = target class (for targeted attacks)

The optimization maximizes the expected log-probability of the target class over the joint distribution of backgrounds, transformations, and locations. Practically:

Sample a batch of background images.
For each image, sample a random transformation (rotation angle, scale factor, translation, brightness multiplier).
Apply the current patch to each image at a sampled location.
Run the classifier forward.
Compute the loss (cross-entropy toward the target class, negated).
Backpropagate gradients to the patch pixels.
Apply a gradient step to the patch; project pixel values to [0, 255].
Repeat.

Because the transformations are differentiable (implemented via spatial transformer networks or bilinear interpolation), gradients flow through them. The result is a patch that, in expectation over physical-world conditions, causes the target classification.

Targeted vs. untargeted. Brown et al.’s original paper focused on targeted attacks: make the classifier predict a specific class (e.g., “toaster”) regardless of the actual image content. Untargeted variants exist but are less studied because causing any misclassification is less practically impactful than causing a specific misclassification.

Patch size. Fooling rates scale with patch size relative to the image. Brown et al. found patches covering roughly 10% of the image area sufficient for >90% targeted attack success on ImageNet classifiers. For object detectors, larger patches (covering the object of interest significantly) are typically required.

Physical-world demonstrations

Hiding from object detectors

Thys et al. (arXiv:1904.08653, 2019) extended patch attacks to person detectors: specifically, YOLOv2 and Faster R-CNN. The attack optimizes a patch that, when printed on a t-shirt or held in front of the body, causes the person detector to fail to produce a bounding box around the person.

The threat model: an adversary who wants to evade camera-based surveillance or automated access control. The patch is worn on a t-shirt or carried as a sign. The person becomes “invisible” to automated detection while remaining fully visible to humans.

Their evaluation: in real camera footage with 4 different people walking at various angles and distances, a printed patch suppressed YOLOv2 person detections to near zero while human observers clearly saw the people. The patch (roughly torso-sized at 40×40 cm) is inconspicuous as clothing.

Stop sign manipulation

Eykholt et al. (arXiv:1707.08945, 2017) attacked traffic sign recognition: adding small stickers to a stop sign to cause an autonomous vehicle perception system to classify it as a 45 mph speed limit sign. The stickers are small enough to look like graffiti or damage to a casual observer.

The attack uses a different formulation: instead of a single large patch, it places multiple small patches in specific positions on the sign. The positions are chosen to be robust to different viewing angles and distances. In real-world tests with a stop sign mounted on a car, the classifier produced the wrong label from distances up to 10 meters.

Face recognition bypass

Face recognition systems have been attacked with printed glasses frames (Sharif et al., arXiv:1603.05027, 2016) designed as adversarial perturbations. The glasses frames look plausibly like normal accessories while causing the face recognition system to identify the wearer as a different person.

The attack involves printing the glasses frame, wearing them, and being photographed — the system produces an incorrect identity. In a more adversarial setting, the printed frames can cause the recognition system to match the adversary’s face to a target identity (impersonation) or cause no match (evasion/unlinking).

Adversarial textures

Patch attacks can be extended to adversarial textures applied to 3D object surfaces. A car painted with an adversarial texture is misclassified as a different object type (or no object) by LiDAR fusion and camera detection systems. This has been demonstrated on 3D-rendered simulations and in physical mock-ups, though real physical demonstrations on vehicles at automotive scale remain more limited due to the difficulty of manufacturing painted adversarial textures at centimeter resolution.

Attacks on object detectors vs. classifiers

Attacking object detectors requires disrupting the full detection pipeline: region proposals, per-region classification, and bounding box regression. This is more complex than attacking a classifier that produces a single label for a whole image.

Hiding objects. Optimizing for the detector to produce no bounding box around a target object. The patch suppresses objectness scores and classification scores simultaneously. The most effective patches cover a significant portion of the target object.

Generating false detections. Optimizing a patch that, when visible anywhere in the camera’s field of view (not necessarily on a real object), causes the detector to hallucinate bounding boxes around non-existent objects. This is a denial-of-service attack on downstream systems that consume detection outputs.

Misclassification. Causing a correctly detected object to be classified as the wrong class without suppressing the detection. Useful when the adversary wants to fool a downstream system that uses the class label (e.g., routing a vehicle to the wrong lane based on a misclassified sign).

DPatch (Liu et al., arXiv:1806.02299) is a generalized framework for patch attacks on object detectors that targets the anchor regression and classification heads directly.

Defenses

Adversarial training for patches

Including patch-perturbed images during training improves model robustness to patches within the training distribution. The model can learn to ignore or downweight anomalous pixel regions that look like patches. Limitations: coverage of the patch design space is limited to what was seen in training; a newly designed patch transfers if it exploits geometry the model didn’t see during training.

Local gradient masking

If a region of the image has anomalously high gradient magnitude toward a specific class, flag it as a potential adversarial patch. This detection is available to defenders with access to the classifier’s gradients, but not in production APIs that don’t expose gradients.

Segment-and-mask

Detect anomalous image regions based on color distribution statistics or semantic segmentation, then mask or inpaint those regions before classification. The risk: false positives on legitimate occluded regions or semantically unusual but benign image content.

Certified defenses

Chiang et al. (arXiv:2003.06693 ↗) developed certified defenses against adversarial patches using interval bound propagation. The certification guarantees correct classification for any patch up to a specified maximum size, but at significant accuracy cost on clean images. The certified region is a worst-case bound, not a tight estimate.

Randomized ablation (Levine and Feizi, arXiv:2002.10733) provides certifiable robustness to L_0 bounded perturbations (which includes patches): randomly mask out portions of the image and take a majority vote over masked versions. This is analogous to randomized smoothing for L_∞ perturbations.

Detection via multi-view consistency

In systems with multiple cameras or temporal sequences (video), a patch that causes misclassification in one frame should cause similar misclassification across adjacent frames and views. Natural scene variation is not perfectly consistent across views, but adversarial patches producing extreme class confidences are easier to detect when inconsistent with neighboring views.

Why patches remain an open problem

Patch attacks are a hard class of adversarial examples to defend because:

No norm constraint = large budget. Defenses designed for small L_∞ perturbations provide no guarantees against patches that use arbitrary pixel values in their footprint.
Physical robustness transfers. EOT optimization produces patches that are robust to the same transforms defenders would apply for detection. A blurring or JPEG defense degrades patch attacks somewhat, but an adversary can compensate by optimizing under the blurring transform.
Detection is hard to generalize. A patch detector trained to detect patches with certain visual characteristics fails against patches designed to look like innocuous content (a logo, a label, decorative patterns).
Transfer across models. Patches computed for one detector transfer to others, as with per-image attacks, enabling black-box attacks against deployed systems.

The most principled defense direction remains certified robustness — but current certified approaches against patches have accuracy costs that make them impractical for many real deployments.

Evaluating adversarial patch research

Claims in the adversarial patch literature should be scrutinized for:

Physical vs. digital evaluation. A patch that works in simulation or on digital images inserted into the dataset isn’t the same as one that works when printed and photographed. Check whether the evaluation involves real-world captures.
Detection bypass. Some evaluations test attack success rate without accounting for whether the patch would be detected as anomalous by a separate detector. Combined attack success and evading detection is harder.
Lighting and angle conditions. Patches tested under a narrow range of lighting and angles will fail outside that range. Look for multi-condition testing.
Adaptive defenders. A patch attack that works against an off-the-shelf detector may fail against one that’s been adversarially trained on similar patches.

Key papers

Brown et al., “Adversarial Patch,” arXiv 2017 (arXiv:1712.09665 ↗)
Athalye et al., “Synthesizing Robust Adversarial Examples,” ICML 2018 (arXiv:1707.07397 ↗)
Eykholt et al., “Robust Physical-World Attacks on Deep Learning Visual Classification,” CVPR 2018 (arXiv:1707.08945 ↗)
Thys et al., “Fooling Automated Surveillance Cameras,” CVPRW 2019 (arXiv:1904.08653 ↗)
Chiang et al., “Certified Defenses for Adversarial Patches,” ICLR 2020 (arXiv:2003.06693 ↗)
Liu et al., “DPatch: An Adversarial Patch Attack on Object Detectors,” AAAI 2019 (arXiv:1806.02299 ↗)
Sharif et al., “Accessorize to a Crime,” CCS 2016 (arXiv:1603.05027 ↗)

Adversarial patches move the adversarial ML threat from a digital phenomenon to a physical one. The perturbations are visible, printable, and deployable against sensors operating in the real world. Understanding them is necessary context for any threat model involving computer-vision-based physical security, autonomous vehicle perception, or camera-based surveillance.

The same physical-world robustness problem appears at a different level when adversarial attacks target multimodal systems that process both images and text — the subject of adversarial attacks on vision-language models.