RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents
Quick Take
RICE-PO introduces a critic-free policy optimization framework that enhances retrieval interactions by converting them into localized learning signals. It outperforms prompt-based agents and group-based RL baselines on BRIGHT and BEIR benchmarks, demonstrating that agent-environment interaction structures can effectively train reasoning-based retrieval agents.
Key Points
- RICE-PO addresses credit-assignment challenges in interactive reasoning for language agents.
- It evaluates local counterfactual branches using retrieval metrics for effective learning.
- The framework shows consistent performance improvements over existing methods on BRIGHT and BEIR.
- High-uncertainty executable actions are used as anchors for localized learning signals.
- Future residual effects are considered to ensure stable credit propagation to reasoning steps.
Article Content
From source RSS / original summaryarXiv:2605. 26352v1 Announce Type: new Abstract: Retrieval is increasingly moving from one-shot matching toward interactive reasoning, where language agents iteratively inspect evidence, reformulate queries, and search again. Training such agents raises a credit-assignment challenge: executable actions such as queries or summaries can be directly evaluated by the retriever, while latent reasoning steps are not directly observable and only affect future executable actions.
This asymmetry makes outcome-level reward assignment unreliable, as the same final reward may credit reasoning steps that did not actually shape retrieval success. We propose RICE-PO, a critic-free policy optimization framework that converts retrieval interactions into localized learning signals.
RICE-PO selects high-uncertainty executable actions as anchors, evaluates local counterfactual branches using retrieval metrics, and propagates credit to latent reasoning steps only when reasoning-to-action influence is strong and future residual effects are stable. On BRIGHT and BEIR, RICE-PO consistently outperforms prompt-based agents and group-based RL baselines under the same retriever setting.
These results show that the structure of agent-environment interaction itself can provide useful supervision for training reasoning-based retrieval agents.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.