Breaking Safety at the Token Boundary: How BPE Tokenization Creates Exploitable Gaps in LLM Alignment
Quick Answer
This paper shows that BPE tokenization in LLMs like Qwen and Llama creates exploitable safety gaps, with 48% of manipulated prompts yielding harmful outputs.
Quick Take
BPE tokenization in LLMs like Qwen and Llama creates exploitable safety gaps, with 48% of manipulated prompts yielding harmful outputs. Testing across five model families shows significant vulnerabilities in safety alignment, necessitating improved defenses.
Key Points
- BPE tokenization fragments safety-critical words, bypassing alignment mechanisms.
- 48% of manipulated HarmBench prompts produced harmful outputs across tested models.
- No fragmented prompts found in 30,000 examples from public alignment datasets.
- SFT on fragmented prompts led to global collapse, raising benign refusal rates.
- Conv-Benign introduced as a diagnostic to distinguish selective repair from global collapse.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2607. 01239v1 Announce Type: new Abstract: Character-level perturbations bypass safety alignment in modern LLMs despite leaving prompts human-readable. We identify and test a central structural mechanism: BPE tokenization fragments safety-critical words into sub-word pieces, and the three public alignment datasets we surveyed contain no intentionally fragmented inputs. The mechanism is a chain, tested end-to-end on five model families (Qwen-3-4B, Qwen-2. 5-7B, Gemma-3-4B, Llama-3. 1-8B, Mistral-7B).
An optimization targeting safety-token fragmentation flips the first-token refusal trigger on 80-100% of refused HarmBench prompts, with 48% of those flips producing genuinely harmful outputs (per-model 29-65%; gap-vs-behavior ROC-AUC 0. 66-0. 98, pooled 0. 84).
Activation patching localizes the disrupted signal to the last ${\sim}30\%$ of layers; an alignment-data scan finds zero fragmented prompts among 30,000 examples (positive-control recall $\geq 99\%$ at attack-relevant intensities); and targeted-mutation experiments isolate safety words as the disruption locus. On the defense side, a 68-cell grid (55 trained checkpoints) shows that no configuration achieves seed- and pool-stable ASR closure on the three families with closed pool-size confounds.
SFT trained on fragmented prompts closes ASR on 3/5 families but only via global collapse that raises refusal on benign prompts as well, indicating the missing distribution is necessary but not sufficient under the LoRA-16 recipe we tested. To distinguish selective repair from global collapse, we introduce Conv-Benign, a candidate paired diagnostic. All ASR claims are 3-judge-calibrated (cell rankings stable across judges; absolute levels $\pm$18pp; see App. ~B. 13).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.