ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

arXiv cs.AI·Jianbo Lin, Xiaomin Yu, Yi Xin, Yifu Guo, Zhuosong Jiang, Zhongqi Yue, Weishi Wang, Heqing Zou, Chengwei Qin, Hui Xiong

5/18/2026

·~2 min·5/18/2026·en·8

Quick Answer

The ICRL framework enables large language models to internalize self-critique using reinforcement learning, resulting in significant performance improvements on agentic and mathematical tasks.

Quick Take

The ICRL framework enables large language models to internalize self-critique using reinforcement learning, resulting in significant performance improvements on agentic and mathematical tasks. Evaluated on Qwen3-4B and Qwen3-8B, ICRL achieved average gains of 6.4 points on agentic tasks and 7.0 points on mathematical reasoning, while the 8B critic performed comparably to larger 32B critics with fewer tokens.

Key Points

ICRL jointly trains a solver and a critic to enhance model self-improvement.
Introduces a distribution-calibration re-weighting ratio for effective critique transfer.
Achieved 6.4 average points gain on agentic tasks and 7.0 on mathematical reasoning.
The 8B critic performs comparably to 32B critics while using fewer tokens.
Code for ICRL is publicly available for further research.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 13 May 2026]

View PDF HTML (experimental)

Abstract:Large language model-based agents make mistakes, yet critique can often guide the same model toward correct behavior. However, when critique is removed, the model may fail again on the same query, indicating that it has not internalized the critique's guidance into its underlying capability. Meanwhile, a frozen critic cannot improve its feedback quality over time, limiting the potential for iterative self-improvement. To address this, we propose learning to internalize self-critique with reinforcement learning(ICRL), a novel framework that jointly trains a solver and a critic from a shared backbone to convert critique-induced success into unassisted solver ability. The critic is rewarded based on the solver's subsequent performance gain, incentivizing actionable feedback. To address the distribution shift between critique-conditioned and critique-free behavior, ICRL introduces a distribution-calibration re-weighting ratio that selectively transfers critique-guided improvements compatible with the solver's own prompt distribution. Additionally, a role-wise group advantage estimation stabilizes joint optimization across the two roles. Together, these mechanisms ensure that the solver learns to improve itself without external critique, rather than becoming dependent on critique-conditioned behavior. We evaluate ICRL on diverse benchmarks spanning agentic and mathematical reasoning tasks, using Qwen3-4B and Qwen3-8B as backbones. Results show consistent improvements, with average gains of 6.4 points over GRPO on agentic tasks, and 7.0 points on mathematical reasoning. Notably, the learned 8B critic is comparable to 32B critics while using substantially fewer tokens. The code is available at this https URL.

Subjects:	Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Cite as:	arXiv:2605.15224 [cs.AI]
	(or arXiv:2605.15224v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.15224 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Jianbo Lin [view email]
[v1] Wed, 13 May 2026 08:50:05 UTC (973 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Ye Liu, Srijan Bansal, Bo Pang, Yang Li, Zeyu Leo Liu, Yifei Ming, Zixuan Ke, Shafiq Joty, Semih Yavuz

3d ago

FeaturedOriginal

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

AI Summary

Procedural Memory Distillation (PMD) enhances reinforcement learning by converting cross-episode signals into reusable memory, improving Qwen3-8B and OLMo3-Instruct-7B models by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on . The co-evolution of policy and memory allows for more effective self-supervision, demonstrating significant performance gains when both components are active.

#LLM #AI Coding #Inference #Policy