RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents

arXiv cs.CL·Mingchen Li, Hansi Zeng, Zhuo Qian, Jiatan Huang, Hamed Zamani, Hong Yu

5/27/2026

·~1 min·5/27/2026·en·2

Quick Answer

RICE-PO introduces a critic-free policy optimization framework that enhances retrieval interactions by converting them into localized learning signals.

Quick Take

RICE-PO introduces a critic-free policy optimization framework that enhances retrieval interactions by converting them into localized learning signals. It outperforms prompt-based agents and group-based RL baselines on BRIGHT and BEIR benchmarks, demonstrating that agent-environment interaction structures can effectively train reasoning-based retrieval agents.

Key Points

RICE-PO addresses credit-assignment challenges in interactive reasoning for language agents.
It evaluates local counterfactual branches using retrieval metrics for effective learning.
The framework shows consistent performance improvements over existing methods on BRIGHT and BEIR.
High-uncertainty executable actions are used as anchors for localized learning signals.
Future residual effects are considered to ensure stable credit propagation to reasoning steps.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2605. 26352v1 Announce Type: new Abstract: Retrieval is increasingly moving from one-shot matching toward interactive reasoning, where language agents iteratively inspect evidence, reformulate queries, and search again. Training such agents raises a credit-assignment challenge: executable actions such as queries or summaries can be directly evaluated by the retriever, while latent reasoning steps are not directly observable and only affect future executable actions.

This asymmetry makes outcome-level reward assignment unreliable, as the same final reward may credit reasoning steps that did not actually shape retrieval success. We propose RICE-PO, a critic-free policy optimization framework that converts retrieval interactions into localized learning signals.

RICE-PO selects high-uncertainty executable actions as anchors, evaluates local counterfactual branches using retrieval metrics, and propagates credit to latent reasoning steps only when reasoning-to-action influence is strong and future residual effects are stable. On BRIGHT and BEIR, RICE-PO consistently outperforms prompt-based agents and group-based RL baselines under the same retriever setting.

These results show that the structure of agent-environment interaction itself can provide useful supervision for training reasoning-based retrieval agents.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Miguel Arana-Catania, Catherine Conisbee, Matthew Kidd

1d ago

FeaturedOriginal

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

AI Summary

The study evaluates three NLP approaches—Named Entity Recognition, Keyword Extraction, and Topic Modelling—using the Their Finest Hour Online Archive to automate keyword extraction from crowdsourced WWII collections. Findings suggest that while NLP methods show promise, no single approach is sufficient, and ethical considerations in automated keyword extraction are crucial for responsible stewardship.

#AI Coding #Inference #Open Source #Policy

RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Quantifying Prior Dominance in Systems