From Brewing to Resolution: Tracing the Internal Lifecycle of Code Reasoning in LLMs
Quick Answer
This study reveals that LLMs like Qwen, Llama, and DeepSeek exhibit a complex internal lifecycle in code reasoning, with only 41.5% of tasks resolved correctly.
Quick Take
This study reveals that LLMs like Qwen, Llama, and DeepSeek exhibit a complex internal lifecycle in code reasoning, with only 41.5% of tasks resolved correctly. The dual diagnostic framework highlights significant task-specific failure modes, such as a drastic drop in function call resolution from 61.1% to 2.5% as call depth increases. Understanding these dynamics is crucial for improving model performance and reliability.
Key Points
- Only 41.5% of code reasoning tasks were resolved correctly across 16 models.
- Function call resolution drops from 61.1% to 2.5% with increased call depth.
- The brewing process remains stable, with duration normalized at 24-42% across models.
- Task-specific failure modes can be masked by similar accuracy metrics.
- Dual diagnostic framework combines layer-wise probing with Context-Stripped Decoding.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 17648v1 Announce Type: new Abstract: Standard accuracy metrics cannot explain why LLMs handle variable tracking but fail on semantically equivalent loops. We study an internal lifecycle of code reasoning in which models first brew the answer, making it linearly recoverable many layers before it becomes self-decodable, and then diverge into one of four resolution outcomes: Resolved, Overprocessed, Misresolved, or Unresolved.
Understanding this lifecycle matters because similar task accuracies can mask fundamentally different failure modes that surface-level evaluation cannot detect. We introduce a dual diagnostic framework pairing layer-wise linear probing with Context-Stripped Decoding (CSD) and apply it to six code-reasoning task families across 16 models spanning Qwen, Llama, and DeepSeek architectures. All four outcomes carry substantial mass in every task family: overall Resolved is only 41. 5%, with multiple tasks below 30%.
Controlled sweeps over structure, depth, and operators expose task-specific failure bottlenecks: Function Call Resolved plunges from 61. 1% to 2. 5% as call depth increases from one to three. Across architectures and scales, the brewing scaffold remains stable, with normalized brewing duration 24-42% across all 16 models, while resolution success varies with capability.
This indicates that the scaffold is a stable empirical regularity across the tested decoder-only Transformer families, whereas resolution success covaries with capability, scale, and training. Code: https://github. com/euyis1019/llm-brewing
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.