All posts
-
UAR: Measuring Neural Network Robustness Against Attacks You Haven't Seen Yet
OpenAI's Unforeseen Attack Robustness metric quantifies how well a classifier holds up against adversarial perturbations outside its training distribution
-
Embedding Inversion: Reconstructing Text From Vectors
Embedding inversion recovers the original text from a model's embedding vectors, breaking the assumption that embeddings are an opaque, privacy-safe
-
Adversarial Training Methods: PGD-AT, TRADES, and MART
Adversarial training is the most defensible empirical robustness method, but 'adversarial training' isn't one thing.
-
Evaluating Adversarial Robustness Without Fooling Yourself
Most defenses that claim robustness are later broken — not because the idea was bad, but because the evaluation was.
-
Adversarial Examples vs. Data Poisoning: Timing Is Everything
Adversarial examples attack a deployed model at inference; data poisoning attacks the model before it is deployed.
-
Membership Inference vs. Model Inversion: Privacy Attacks
Membership inference asks 'was this sample in the training set?' Model inversion asks 'what samples were in the training set?
-
Adversarial Attacks on Vision-Language Models: CLIP, LLaVA, GPT-4
Vision-language models expand the adversarial attack surface beyond image classifiers: adversarial images can manipulate text outputs, carry visual
-
Adversarial Patch Attacks: Physical Perturbations That Fool ML
Adversarial patches are large, visible, localized perturbations designed to survive physical-world conditions — printing, lighting, and camera optics.
-
Universal Adversarial Perturbations: One Vector That Fools Inputs
Unlike per-image attacks, universal adversarial perturbations are input-agnostic: a single crafted noise vector causes misclassification across virtually
-
Adversarial Robustness in NLP: Why Text Attacks Are Different
Discrete input spaces, semantic constraints, and human-perceptibility rules change what counts as an adversarial example in text.
-
Data Poisoning and Backdoor Attacks on Foundation Models
Training data manipulation, backdoor triggers, and Trojan attacks against large-scale models. What the threat model actually requires and where the
-
Evasion Attacks on Image Classifiers: FGSM, PGD, and C&W
The three foundational gradient-based evasion attacks, what each one actually optimizes, and what the benchmark numbers mean when you're evaluating a defense.
-
Model Inversion Attacks: Reconstructing Training Data from Output
From Fredrikson's pharmacogenetics exploit to Geiping's gradient inversion, model inversion attacks recover private training data in ways most ML
-
Adversarial Transferability: Why Black-Box Attacks Work at All
Adversarial examples transfer across models with different architectures and training sets. Understanding why changes what you think defenses need to
-
Certified Robustness via Randomized Smoothing: What It Guarantees
Randomized smoothing gives you a provable robustness radius. Understanding what that certificate means in practice — and where it breaks — is more useful
-
Training Data Extraction from LLMs: The Carlini Results Explained
Carlini et al. demonstrated verbatim extraction of training data from GPT-2. The results have been widely misread.
-
Membership Inference Attacks: What Works on Production ML APIs
Shokri et al.'s shadow-model attack is the canonical reference, but the gap between the paper's threat model and a real rate-limited API is wide.
-
GCG-Class Adversarial Suffix Attacks: A 2026 Practitioner Primer
The math, the cost curve, and why optimization-based attacks are now within reach of solo practitioners. With reproducible setup and what defenders
-
Model Extraction via Query-Based Functional Stealing
Query-based model stealing attacks can recover a functionally equivalent model from API access alone. The economics matter more than the technique: here's