Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models

arXiv cs.AI·Sanjay Kariyappa, G. Edward Suh

6/9/2026

·~2 min·6/9/2026·en·2

Quick Answer

This paper shows that A new framework identifies failures in reasoning language models like Gemma-4-31B-IT and Claude Sonnet 4.6, revealing that dominant failure modes vary by model and context.

Quick Take

Self-monitoring mechanisms significantly reduce non-compliance by up to 99%, enhancing instruction adherence in AI workflows.

Key Points

Framework localizes failures to instruction identification, conflict resolution, and response realization.
Evaluation on IHEval and IHChallenge shows varying failure modes across models and tasks.
Self-monitoring mechanisms reduce non-compliance by 81-99% across multiple models.
GPT-5.3 shows 86% reduction under static attacks and 45% under adaptive attacks.
Findings highlight the need for improved instruction hierarchy adherence in AI systems.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

arXiv:2606. 07808v1 Announce Type: new Abstract: Reasoning language models deployed in agentic workflows must follow an instruction hierarchy: when instructions from different sources conflict, the model should obey the highest-privilege applicable instruction. Existing benchmarks largely measure this behavior end-to-end, asking whether the final response is compliant.

However, a non-compliant response can arise from several distinct failures: the model may fail to identify the relevant instructions in context, fail to resolve conflicts among identified instructions, or correctly resolve the conflict in its reasoning while still producing a violating response. …

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Vinil Pasupuleti, Shyalendar Reddy Allala, Siva Rama Krishna Varma Bayyavarapu, Shrey Tyagi, Srinivasateja Songa

3h ago

FeaturedOriginal

AINTMA: Agentic AI Architecture for Autonomous Test Management with Generative Intelligence, Secure Cloud Communication and Adaptive Quality Analytics

AI Summary

AINTMA, an autonomous test management architecture utilizing six specialized AI agents, achieves 88.4% test prioritization accuracy and reduces defect escape rates from 8.3% to 2.1%. The system demonstrates a 340% ROI within nine months, showcasing the potential of agentic AI in enhancing software quality management in cloud environments.

#Agent #AI Coding #Security #Enterprise AI

Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.AI

AINTMA: Agentic AI Architecture for Autonomous Test Management with Generative Intelligence, Secure Cloud Communication and Adaptive Quality Analytics

RAIL Guard: Closing the Evaluation-to-Remediation Gap in Responsible AI for Agents

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Powered Agentic System

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.AI

AINTMA: Agentic AI Architecture for Autonomous Test Management with Generative Intelligence, Secure Cloud Communication and Adaptive Quality Analytics

RAIL Guard: Closing the Evaluation-to-Remediation Gap in Responsible AI for LLM Agents

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Large Language Model Powered Agentic System

RAIL Guard: Closing the Evaluation-to-Remediation Gap in Responsible AI for Agents

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Powered Agentic System