Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward
Quick Take
DecomposeR introduces a planner-centric framework for deep research tasks, utilizing typed directed acyclic graphs (DAGs) for structured planning. The Qwen3-8B model achieves a 5.1-8.0 point improvement on long-form benchmarks by optimizing planning and execution through explicit rewards for planner tokens.
Key Points
- DecomposeR uses typed DAGs for explicit and structured research planning.
- The framework consists of two stages: planner RL and answerer RL.
- Qwen3-8B model outperforms strong baselines by 5.1-8.0 points.
- Rewards are assigned to planner tokens, enhancing optimization granularity.
- The approach reduces ambiguity in end-to-end training.
Article Content
From source RSS / original summaryarXiv:2605. 30824v1 Announce Type: new Abstract: Deep research tasks require LLMs to plan what to investigate, retrieve evidence, and synthesize long-form answers across multiple branches of inquiry. Existing training paradigms either rely on short-form verifiable QA as a proxy or optimize monolithic long trajectories, which makes planning and execution difficult to disentangle and yields weak credit assignment for the planning process.
We propose DecomposeR, a planner-centric deep research framework that represents research plans as typed directed acyclic graphs (DAGs), allowing planning to be made explicit, structured, and rewardable. We train a Qwen3-8B model in two stages: planner reinforcement learning (RL) first learns graph structure and query decomposition to improve research planning, and answerer reinforcement learning (RL) then learns branch-level execution and final synthesis conditioned on the learned plan.
By assigning rewards to explicit planner tokens and structured components rather than to a flat trajectory, DecomposeR enables finer-grained optimization of planning while reducing the ambiguity of end-to-end training. Experiments show that DecomposeR-8B improves over strong comparable open baselines by 5. 1-8. 0 points on popular long-form benchmarks due to improved planning and answering capabilities.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane
The Redpanda Agentic Data Plane (ADP) introduces out-of-band metadata channels to enhance the safety of autonomous AI agents, ensuring secure data access and tamper-proof audit trails. This architecture mitigates risks associated with unpredictable AI behavior by enforcing governance throughout the agent lifecycle, demonstrated in a multi-agent trading system with strict data scoping and approval thresholds.