CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward
Quick Answer
This paper shows that CacheRL trains small agent models achieving 92% accuracy on multi-step tool-calling tasks, nearing GPT-5's 94% while using 100x less compute.
Quick Take
CacheRL trains small agent models achieving 92% accuracy on multi-step tool-calling tasks, nearing GPT-5's 94% while using 100x less compute. Key innovations include a hybrid thinking trajectory pipeline, a three-tier fuzzy cache, and cache-aware rewards, enhancing performance significantly against leading models.
Key Points
- Achieves 92% accuracy on tool-calling tasks, close to GPT-5's 94%.
- Uses 100 times less compute than larger models for training.
- Introduces hybrid thinking trajectories for enhanced learning.
- Implements a three-tier fuzzy cache to eliminate live execution costs.
- Cache-aware rewards improve performance by 17% in benchmarks.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 14179v1 Announce Type: new Abstract: We present CacheRL, a system for training small agent foundation models that achieves 92 percent process accuracy on multi-step tool-calling tasks, approaching GPT-5's 94 percent while requiring 100 times less compute.
Our approach addresses three challenges in practical agent training: transferring tool-calling knowledge from large models at scale, enabling reinforcement learning without costly live tool execution, and learning robustly from noisy cached environments. CacheRL introduces three key innovations. First, a hybrid thinking trajectory pipeline augments agent trajectories with LLM-generated reasoning traces, producing training examples that teach models not only what tools to call but also why.
Second, the CacheAgentLoop eliminates live execution costs through a three-tier fuzzy cache while preserving trajectory fidelity using token-level masking. Third, a cache-tier-aware reward dynamically adjusts answer-quality weights to avoid penalizing models for cache-induced limitations. Through iterative supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), CacheRL improves Qwen3-4B-Thinking's validation reward from 0. 43 to 0. 78.
On public agentic tool-calling benchmarks, our model achieves competitive performance against frontier models such as GPT-5. Ablation studies show that removing knowledge transfer reduces performance by 41 percent, while cache-aware rewards contribute a 17 percent improvement.
Interestingly, reinforcement learning improves training stability but yields limited gains beyond strong supervised fine-tuning, suggesting that data quality and reward design play a more important role than complex optimization methods in building practical small agent models.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.