Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward

arXiv cs.AI·Mustafa Anis Hussain, Xinle Wu, Yao Lu

6/1/2026

·~1 min·6/1/2026·en·2

Quick Answer

DecomposeR introduces a planner-centric framework for deep research tasks, utilizing typed directed acyclic graphs (DAGs) for structured planning.

Quick Take

DecomposeR introduces a planner-centric framework for deep research tasks, utilizing typed directed acyclic graphs (DAGs) for structured planning. The Qwen3-8B model achieves a 5.1-8.0 point improvement on long-form benchmarks by optimizing planning and execution through explicit rewards for planner tokens.

Key Points

DecomposeR uses typed DAGs for explicit and structured research planning.
The framework consists of two stages: planner RL and answerer RL.
Qwen3-8B model outperforms strong baselines by 5.1-8.0 points.
Rewards are assigned to planner tokens, enhancing optimization granularity.
The approach reduces ambiguity in end-to-end training.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2605. 30824v1 Announce Type: new Abstract: Deep research tasks require LLMs to plan what to investigate, retrieve evidence, and synthesize long-form answers across multiple branches of inquiry. Existing training paradigms either rely on short-form verifiable QA as a proxy or optimize monolithic long trajectories, which makes planning and execution difficult to disentangle and yields weak credit assignment for the planning process.

We propose DecomposeR, a planner-centric deep research framework that represents research plans as typed directed acyclic graphs (DAGs), allowing planning to be made explicit, structured, and rewardable. We train a Qwen3-8B model in two stages: planner reinforcement learning (RL) first learns graph structure and query decomposition to improve research planning, and answerer reinforcement learning (RL) then learns branch-level execution and final synthesis conditioned on the learned plan.

By assigning rewards to explicit planner tokens and structured components rather than to a flat trajectory, DecomposeR enables finer-grained optimization of planning while reducing the ambiguity of end-to-end training. Experiments show that DecomposeR-8B improves over strong comparable open baselines by 5. 1-8. 0 points on popular long-form benchmarks due to improved planning and answering capabilities.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·David Krongauz, Arad Zulti, Eran Segal, Teddy Lazebnik

5h ago

FeaturedOriginal

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Large Language Model Powered Agentic System

AI Summary

The MEDA system utilizes large language models and symbolic regression to autonomously discover ordinary differential equations for biological systems, achieving strong structural recovery and biologically plausible models. It outperforms existing methods by integrating domain knowledge and mechanistic constraints, demonstrating effective retrieval and extrapolation capabilities.

#LLM #Agent #Inference #AI Startup