RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents
Quick Answer
This paper shows that RODS (Reward-driven Online Data Synthesis) addresses the depletion of informative samples in multi-turn tool-use reinforcement learning by synthesizing new data based on reward variance.
Quick Take
RODS (Reward-driven Online Data Synthesis) addresses the depletion of informative samples in multi-turn reinforcement learning by synthesizing new data based on reward variance. It achieves comparable performance to a 17K-sample offline pipeline using only 800 samples, requiring 20x fewer trajectories and dynamically evolving with the policy.
Key Points
- RODS synthesizes new samples using reward variance as a boundary detector.
- It maintains an active training pool of approximately 800 samples.
- RODS achieves similar performance to a 17K-sample offline pipeline.
- Requires roughly 20x fewer trajectories than traditional methods.
- Dynamic replay buffer evolves alongside the policy during training.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 19047v1 Announce Type: new Abstract: Multi-turn RL is bottlenecked by the rapid depletion of informative samples in static datasets. We observe that the gradient signal in GRPO concentrates on tasks with the highest rollout reward variance, a consequence of the Popoviciu upper bound. Consequently, samples near the agent's capability boundary -- where successes and failures are roughly balanced -- contribute disproportionately large policy gradients.
As training progresses, this boundary continuously shifts, which gradually depletes the pool of informative samples in a static dataset. We propose RODS (Reward-driven Online Data Synthesis) to resolve this depletion. RODS closes the loop between RL training and data generation by repurposing the progress reward variance as a practical, zero-cost boundary detector that requires no extra inference beyond the rollouts already computed for training.
It continuously identifies such boundary samples, synthesizes new multi-turn variants matching their structural complexity (e. g. , API topology and dependency depth) via a skill-aligned resampling pipeline, and manages a dynamic replay buffer that co-evolves with the policy.
Starting from 400 human seeds and maintaining an active training pool of ~800 samples, RODS achieves comparable performance to a 17K-sample offline pipeline while requiring roughly 20x fewer trajectories, and improves over fixed-data RL and environment augmentation in our controlled setting.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.