RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use… | AI Deep Signal

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

arXiv cs.AI·Ruishan Fang, Siyuan Lu, Chenyi Zhuang, Tao Lin

6/18/2026

·~2 min·6/18/2026·en·2

Quick Answer

This paper shows that RODS (Reward-driven Online Data Synthesis) addresses the depletion of informative samples in multi-turn tool-use reinforcement learning by synthesizing new data based on reward variance.

Quick Take

It achieves comparable performance to a 17K-sample offline pipeline using only 800 samples, requiring 20x fewer trajectories and dynamically evolving with the policy.

Key Points

RODS synthesizes new samples using reward variance as a boundary detector.
It maintains an active training pool of approximately 800 samples.
RODS achieves similar performance to a 17K-sample offline pipeline.
Requires roughly 20x fewer trajectories than traditional methods.
Dynamic replay buffer evolves alongside the policy during training.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

Multi-turn RL is bottlenecked by the rapid depletion of informative samples in static datasets. We observe that the gradient signal in concentrates on tasks with the highest rollout reward variance, a consequence of the Popoviciu upper bound. Consequently, samples near the agent's capability boundary -- where successes and failures are roughly balanced -- contribute disproportionately large policy gradients. As training progresses, this boundary continuously shifts, which gradually

Read the full article on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Ji Wu, Yunshan Peng, Wentao Bai, Yunke Bai, Wenzheng Shu, Jinan Pang, Yanxiang Zeng, Xialong Liu

4d ago

FeaturedOriginal

HOBA: Hierarchical On-Policy Bidding Agents for Adaptive Online Advertising

AI Summary

HOBA (Hierarchical On-policy Bidding Agents) is a novel hierarchical reinforcement learning framework that enhances online advertising bidding systems by improving adaptability and reducing hyperparameter tuning costs. It utilizes a for hyperparameter inference, a SARSA agent for expert model selection, and a dynamic expert pool for bid execution, achieving a +3.6% increase in target cost during large-scale deployment and outperforming state-of-the-art baselines on AuctionNet.

#LLM #Agent #Inference #AI Startup

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.AI

HOBA: Hierarchical On-Policy Bidding Agents for Adaptive Online Advertising

AINTMA: Agentic AI Architecture for Autonomous Test Management with Generative Intelligence, Secure Cloud Communication and Adaptive Quality Analytics

RAIL Guard: Closing the Evaluation-to-Remediation Gap in Responsible AI for Agents

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.AI

HOBA: Hierarchical On-Policy Bidding Agents for Adaptive Online Advertising

AINTMA: Agentic AI Architecture for Autonomous Test Management with Generative Intelligence, Secure Cloud Communication and Adaptive Quality Analytics

RAIL Guard: Closing the Evaluation-to-Remediation Gap in Responsible AI for LLM Agents

RAIL Guard: Closing the Evaluation-to-Remediation Gap in Responsible AI for Agents