What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics
Quick Answer
The study reveals that jailbreak vulnerabilities in LLMs like Llama, Qwen, and Gemma are primarily encoded in intermediate layers, with entropy dynamics providing a more informative signal than static statistics.
Quick Take
The study reveals that jailbreak vulnerabilities in LLMs like Llama, Qwen, and Gemma are primarily encoded in intermediate layers, with entropy dynamics providing a more informative signal than static statistics. This indicates that harmful intent is better detected through evolving token-level entropy rather than final output assessments.
Key Points
- Intermediate layers show concentrated entropy dynamics relevant to jailbreak detection.
- Static prompt-level entropy statistics provide minimal discriminative power.
- Entropy evolution features outperform traditional metrics in identifying harmful intent.
- Findings apply across multiple models and adversarial benchmarks without extra training.
- Jailbreak behavior is reflected in structured uncertainty dynamics within the model.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 25182v1 Announce Type: new Abstract: Jailbreak attacks reveal a persistent weakness in aligned Large Language Models: carefully crafted prompts can elicit policy-violating responses despite safety training. While most defenses operate at the prompt or output level, it remains unclear how harmful intent is encoded within the model's internal representations. We investigate this question by analyzing token-level predictive entropy trajectories across layers of a frozen LLM using the logit lens.
We find that static aggregate statistics of prompt-level entropy (e. g. , mean, variance) carry little discriminative signal, whereas features capturing how entropy evolves across token positions, such as monotonic rank-based trend scores, are substantially more informative.
Importantly, this signal is not uniform across model depth: it is concentrated in intermediate layers and degrades at the final layer, indicating that jailbreak-relevant structure is most pronounced in mid-network representations rather than at the output head. Across multiple models (Llama, Qwen, Gemma) and adversarial benchmarks, these entropy dynamics provide architecture-consistent separation without additional training.
Together, our findings show that jailbreak behavior is reflected in structured intermediate uncertainty dynamics, clarifying both which entropy-derived features encode harmful intent and where in the network that signal is most pronounced.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

