Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models
Quick Answer
This paper shows that A new framework identifies failures in reasoning language models like Gemma-4-31B-IT and Claude Sonnet 4.6, revealing that dominant failure modes vary by model and context.
Quick Take
A new framework identifies failures in reasoning language models like Gemma-4-31B-IT and Claude Sonnet 4.6, revealing that dominant failure modes vary by model and context. Self-monitoring mechanisms significantly reduce non-compliance by up to 99%, enhancing instruction adherence in AI workflows.
Key Points
- Framework localizes failures to instruction identification, conflict resolution, and response realization.
- Evaluation on IHEval and IHChallenge shows varying failure modes across models and tasks.
- Self-monitoring mechanisms reduce non-compliance by 81-99% across multiple models.
- GPT-5.3 shows 86% reduction under static attacks and 45% under adaptive attacks.
- Findings highlight the need for improved instruction hierarchy adherence in AI systems.
Article Content
From source RSS / original summaryarXiv:2606. 07808v1 Announce Type: new Abstract: Reasoning language models deployed in agentic workflows must follow an instruction hierarchy: when instructions from different sources conflict, the model should obey the highest-privilege applicable instruction. Existing benchmarks largely measure this behavior end-to-end, asking whether the final response is compliant.
However, a non-compliant response can arise from several distinct failures: the model may fail to identify the relevant instructions in context, fail to resolve conflicts among identified instructions, or correctly resolve the conflict in its reasoning while still producing a violating response. We introduce a white-box diagnostic framework that localizes instruction hierarchy failures into instruction identification, conflict resolution, and response realization, making failures more interpretable.
We evaluate three reasoning models--Gemma-4-31B-IT, Qwen3. 6-35B-A3B, and Claude Sonnet 4. 6--on long-context adaptations of IHEval and IHChallenge, and find that the dominant failure mode varies across models, tasks, and context length.
Building on the observation that models can often detect conflicts and output violations when explicitly prompted, we propose two training-free self-monitoring mechanisms: a parallel input monitor for low-latency conflict detection before generation, and a sequential output monitor for response-level review and repair. Across Gemma-4-31B-IT, Claude Sonnet 4. 6, and GPT-5. 3, the strongest monitor reduces rule-following non-compliance by 81-99%, with GPT-5.
3 reductions of 86% under static attacks and 45% under adaptive attacks.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective
This paper addresses the sim-to-real gap for foundation model agents by framing it within a Markov Decision Process (MDP) structure. It advocates for established solutions like domain randomization to enhance agent robustness, aiming to create standardized benchmarks for reliable real-world applications.