WRIT: Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents
Quick Take
The WRIT pipeline synthesizes complex multi-turn training trajectories for user-facing agents, enabling robust decision-making under high information load. A 4B model trained on 2K WRIT trajectories outperforms GPT-5.1 on the τ²-bench while reducing inference-time token usage, demonstrating efficient agent behavior.
Key Points
- WRIT synthesizes write-intensive and read-heavy tasks for agent training.
- It diversifies user behavior to reflect realistic conversational variations.
- Training with WRIT improves decision-making under high information load.
- A 4B model trained on WRIT outperforms GPT-5.1 on τ²-bench.
- Only 2K synthesized trajectories significantly reduce inference-time token usage.
Article Content
From source RSS / original summaryarXiv:2606. 02908v1 Announce Type: new Abstract: Multi-turn user-facing agents must infer user intent from incomplete requests, collect missing information through dialogue and tools, and execute valid actions. A training trajectory records this process as an interleaved sequence of user messages, agent responses, tool calls, etc.
Synthesizing sufficiently complex trajectory has become a central route to train agents: existing pipelines often increase difficulty by composing multiple user requests into longer tasks, producing write-intensive trajectories that train sequential execution. We argue that a single write decision can itself be difficult when the agent must gather and compare substantial read-tool evidence before its arguments become identifiable, a challenge that write-intensive data alone cannot address.
Guided by this insight, we propose WRIT (\uline{W}rite-\uline{R}ead \uline{I}ntensive \uline{T}rajectory Synthesis), a pipeline for synthesizing multi-turn agent training trajectories along two complexity axes: the number of write decisions in a task and the evidence burden of each individual decision. WRIT first generates write-intensive and read-heavy tasks.
It then diversifies user behavior instructions to reflect realistic conversational variation, and finally simulates agent-user interactions in an executable environment to produce complete training trajectories. The resulting data trains agents not only for longer task execution, but also for robust, evidence-grounded decision making under high information load. With only 2K synthesized trajectories, a 4B model trained on WRIT outperforms GPT-5.
1 no-think on $\tau^2$-bench and substantially reduces inference-time token usage, showing that compact SFT data can convert part of expensive test-time reasoning into efficient agent behavior.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.