Closing the Reflection Gap: A Free Calibration Bonus for Agentic RL
Quick Answer
The proposed RefGRPO method enhances agentic reinforcement learning by introducing a calibration bonus that improves reflection accuracy and task performance.
Quick Take
The proposed RefGRPO method enhances agentic reinforcement learning by introducing a calibration bonus that improves reflection accuracy and task performance. It reduces the underconfidence rate from 44.4% to 7.7% and increases task accuracy from 75.1% to 76.5% across five text-to-SQL benchmarks, enabling better self-improvement and selective prediction.
Key Points
- RefGRPO augments standard RL with a free calibration bonus for improved performance.
- Underconfidence rate decreased from 44.4% to 7.7% with RefGRPO implementation.
- Task accuracy improved from 75.1% to 76.5% across five text-to-SQL benchmarks.
- Agents can use reflections as pseudo-rewards for self-improvement without supervision.
- Dynamic scheduling of calibration coefficient enhances agent decision-making.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 14211v1 Announce Type: new Abstract: LLMs are increasingly deployed as agents that interact with external environments and observe feedback such as execution results, error messages, and tool outputs. A well-functioning agent should be able to leverage this feedback to accurately assess its own performance.
Yet we find a persistent reflection gap: LLM agents tend to mis-assess their own outputs after observing concrete environment feedback -- even for questions they correctly answered -- and standard RL barely helps due to a credit-assignment mismatch.
To close this gap, we propose RefGRPO, a simple yet effective fix that augments standard RL algorithms with two key ingredients: a free calibration bonus computed by contrasting the agent's own reflection with the actual outcome (requiring no additional reward model, LLM judge, or external annotation), and a dynamic schedule on its coefficient. Compared to standard RL baselines, our method simultaneously improves reflection calibration (e. g. , reduces underconfidence rate $44. 4\% \to 7.
7\%$) and task accuracy (e. g. , $75. 1\% \to 76. 5\%$) on text-to-SQL across five benchmarks. The resulting calibrated reflection turns the agent into its own verifier grounded in environment feedback, which further enables (i) better self-improvement that uses reflections as pseudo-rewards without outcome supervision, and (ii) more effective test-time selective prediction by committing only to rollouts flagged as correct.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.