Adversarial Robustness in NLP: Why Text Attacks Are Different

Adversarial examples in computer vision exploit the gap between how pixels map to human perception and how they map to model predictions. The perturbation is small in L-infinity or L2 norm; humans can’t see it; the model is fooled. This formalization doesn’t transfer cleanly to text, and that non-transfer has significant consequences for what attacks and defenses look like.

The discreteness problem

The L-infinity norm ball formalization works for images because pixel values are continuous and small perturbations are semantically invisible to humans. Text is discrete: characters and tokens take values from a finite vocabulary. There’s no meaningful notion of “add epsilon to the word ‘dog’.”

This forces a different definition of admissible perturbation. For text, the analogue of “imperceptible” is something like:

Character-level: typos, homoglyphs, insertion/deletion of spaces or punctuation that a human reader would parse as unchanged
Word-level: synonym substitution, paraphrase, dropping or inserting uninformative words
Sentence-level: paraphrase that preserves meaning but changes surface form

Each of these preserves human interpretation while potentially changing model predictions. But none of them has the clean mathematical structure of an L-p ball. Evaluating whether a text perturbation is “semantically equivalent” requires a judgment call that’s often proxied by automated metrics (BERTScore, BLEURT, sentence embedding distance) that are themselves imperfect.

This creates an evaluation problem: the community has no consensus on what the right imperceptibility constraint is for text, which makes comparing attack and defense results harder than in vision.

Word-substitution attacks

The most studied class of text adversarial attacks substitutes individual words with semantically similar alternatives. The attack iterates over words in the input, identifies which word’s substitution has the highest impact on model loss, and replaces it with a synonym or semantically similar word that doesn’t change the human-readable meaning.

TextFooler (Jin et al., 2020, arXiv:1907.11932 ↗) is the canonical reference for word-substitution attacks on sentiment classifiers and natural language inference models. The algorithm:

Rank words by their importance to the prediction (measured by prediction change when each word is deleted).
For the most important word, find substitution candidates using a word embedding nearest-neighbor search (counter-fitted GloVe embeddings).
Filter candidates by part-of-speech match and semantic similarity (USE cosine similarity above a threshold).
Select the candidate that maximally reduces target-class confidence.
Repeat until the prediction flips or the perturbation budget is exhausted.

TextFooler achieved attack success rates of 87-97% on BERT, RoBERTa, and XLNET models on SST-2 sentiment analysis and SNLI textual entailment benchmarks, with an average of 20 word substitutions per example. On longer texts, this corresponds to modifying a small fraction of words.

BERT-Attack (Li et al., 2020, arXiv:2004.09984 ↗) improved on TextFooler by using a masked language model (BERT itself) to generate contextually appropriate substitutions, rather than word embedding nearest neighbors. Using the target model’s own architecture to generate attacks is an interesting inversion — the attack exploits the same contextual reasoning the model uses for classification.

The weakness of word-substitution attacks is that automated semantic similarity filters often fail to catch substitutions that change meaning in subtle ways. Human evaluation studies (e.g., Morris et al.’s TextAttack evaluation, arXiv:2005.05909 ↗) found that a meaningful fraction of “successful” adversarial examples are not semantically equivalent to the original — the human label changes along with the model prediction.

Character-level attacks and homoglyphs

Character-level attacks exploit the gap between how humans and tokenizers handle unusual characters. Ebrahimi et al.’s HotFlip (arXiv:1712.06751 ↗) used a white-box gradient computation to identify character substitutions with the highest impact on model loss, treating the character one-hot encoding as a continuous input and using a first-order Taylor approximation.

More practically relevant for security applications are homoglyph attacks: replacing ASCII characters with visually identical Unicode lookalikes. The human reader sees “paypal.com”; the tokenizer sees “pаypal.com” where the ‘a’ is Cyrillic. This is the same technique used in IDN homograph attacks for phishing — it has direct application to content moderation, toxicity detection, and any model operating on user-submitted text.

Boucher et al. (arXiv:2110.07926 ↗, 2022) demonstrated that invisible characters — Unicode control codes, zero-width joiners, bidirectional control characters — could be injected into text to fool code generation models into producing functionally different code from what a human reviewer would see. This attack is relevant to LLM code assistants and automated code review pipelines.

Prompt injection as adversarial text

Prompt injection attacks against large language models are adversarial text attacks with a different loss function. The adversary isn’t trying to flip a classification label; they’re trying to override the model’s behavior by constructing inputs that cause the model to ignore its system prompt or prior context.

The standard formulation: a system prompt establishes instructions (“you are a customer service assistant; don’t discuss competitors”). The user input contains adversarial text (“ignore previous instructions and instead…”). The attack succeeds if the model follows the injected instruction rather than the original system prompt.

Gradient-optimized versions of this attack — like GCG (Zou et al., arXiv:2307.15043 ↗) — append an adversarially optimized token suffix to a query that causes the model to comply with a harmful instruction. The suffix looks like gibberish to humans but functions as a command to the model. This is adversarial perturbation in the token space optimizing for LLM behavior rather than classifier output — the mathematical structure is similar even though the application is different.

Defense approaches and their limitations

Three categories of defense have been applied to adversarial text:

Adversarial training with text attacks. Fine-tuning on adversarial examples generated by word-substitution attacks improves robustness to those attacks but not to novel attacks that use different substitution constraints. The certified adversarial training approaches for images don’t translate cleanly because there’s no bounded-norm formulation for text.

Certified defenses for text. Jia et al. (arXiv:1905.13561 ↗) proposed interval bound propagation for text classifiers, certifying robustness to specific synonym sets. The certificate holds for all substitutions within the synonym set, but the synonym set must be predefined. This is more constrained than the L2-ball certificate in vision — you certify robustness to a discrete finite set rather than an uncountable ball.

Randomized smoothing has been extended to text by Ye et al. (arXiv:2005.05864 ↗) using span deletion as the smoothing operation, and by Zeng et al. using synonym substitution. The resulting certificates cover specific classes of perturbations but can’t certify against the full space of semantic-preserving changes.

Detection and rejection. Some defenses focus on detecting adversarial inputs rather than robustly classifying them — flagging examples where the classifier’s prediction is unstable under paraphrase or where token statistics are unusual (e.g., unusual Unicode characters, high frequency of rare words). Detection avoids the need for certified robustness but is vulnerable to evasion by adaptive adversaries who account for the detector in their attack optimization.

The overall picture is that text adversarial robustness is significantly less mature than image adversarial robustness. There’s no equivalent of RobustBench for NLP — no standardized benchmark with agreed-upon threat models and adaptive evaluation. Benchmark coverage for NLP adversarial evaluations is sparse; the state of standardized evaluation is tracked at aisecbench.com ↗.

Why the evaluation problem matters more than in vision

In vision, the L-p norm ball provides a clean model of “imperceptible perturbation” that most researchers accept as a useful approximation. Evaluation is automated and reproducible.

In NLP, the lack of a consensus constraint means that published attack success rates can reflect constraint violations — the adversarial example changes the human-readable label — rather than genuine robustness failures. A model that gets 90% attacked on TextFooler may actually have wrong outputs on only 60% of the “adversarial” examples if the other 30% changed meaning enough that the attack label is itself wrong.

This makes building production NLP robustness pipelines harder. Defenses that look strong on standard benchmarks may not be solving the right problem. The honest answer is to do human evaluation of adversarial example validity, use multiple attack methods with different perturbation types, and be explicit about which threat model — content-preserving word substitution, character-level manipulation, prompt injection — you’re actually defending against.

Engineering guidance for deploying robustness evaluation pipelines against NLP models, including tooling comparisons for TextAttack, OpenAttack, and LLM-specific evaluation suites, is at aidefense.dev ↗.

References

Jin et al., “Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment” (2020), arXiv:1907.11932 ↗
Li et al., “BERT-ATTACK: Adversarial Attack Against BERT Using BERT” (2020), arXiv:2004.09984 ↗
Morris et al., “TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP” (2020), arXiv:2005.05909 ↗
Ebrahimi et al., “HotFlip: White-Box Adversarial Examples for Text Classification” (2018), arXiv:1712.06751 ↗
Boucher et al., “Bad Characters: Imperceptible NLP Attacks” (2022), arXiv:2110.07926 ↗
Zou et al., “Universal and Transferable Adversarial Attacks on Aligned Language Models” (2023), arXiv:2307.15043 ↗
Jia et al., “Certified Robustness to Adversarial Word Substitutions” (2019), arXiv:1905.13561 ↗
Ye et al., “SAFER: A Structure-free Approach for Certified Robustness to Adversarial Word Substitutions” (2020), arXiv:2005.05864 ↗

Adversarial Robustness in NLP: Why Text Attacks Are Different

The discreteness problem

Word-substitution attacks

Character-level attacks and homoglyphs

Prompt injection as adversarial text

Defense approaches and their limitations

Why the evaluation problem matters more than in vision

References

Adversarial ML — in your inbox

Related

Data Poisoning and Backdoor Attacks on Foundation Models

Evasion Attacks on Image Classifiers: FGSM, PGD, and C&W

Adversarial Transferability: Why Black-Box Attacks Work at All

Comments