ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents

arXiv cs.AI·Anjie Liu, Yan Song, Zhixun Chen, Ziqin Gong, Zhongwei Yu, Jun Wang

4h ago

·~1 min·6/3/2026·en·0

Quick Take

ToolGate introduces a lightweight controller for tool-augmented vision-language agents, reducing token costs to 64-69% of the ReAct baseline while maintaining accuracy. It improves decision-making on tool calls, enhancing performance by 1.65 points in matched-domain training on Qwen3-VL-30B. This advancement benefits agents by optimizing when to utilize external perceptual tools.

Key Points

ToolGate predicts execute/skip decisions using trajectory text and structural features.
Baseline agents show poor selectivity, with helpful and harmful calls at similar rates.
Token costs are significantly reduced while preserving average accuracy across domains.
Matched-domain training on Qwen3-VL-30B yields a 1.65-point accuracy improvement.

Article Content

From source RSS / original summary

arXiv:2606. 03054v1 Announce Type: new Abstract: Tool-augmented vision-language agents can acquire external perceptual evidence through OCR, detection, segmentation, and other tools, but executing every proposed tool call is costly and sometimes unnecessary. We study the pre-call control problem: after a ReAct-style VLM agent proposes a perceptual tool call, should the call be executed, or skipped before its output enters the context?

Across five benchmarks, we find that the baseline agent exhibits poor local selectivity: helpful and harmful calls occur at similar rates (11. 8% vs. 9. 9%), while most calls do not change the immediate forced-answer prediction. We introduce ToolGate, a lightweight external controller that predicts execute/skip decisions from trajectory text and simple structural features.

Across two Qwen3-VL backbones, ToolGate reduces token cost to 64-69% of the unrestricted ReAct baseline while preserving average accuracy in cross-domain settings. With matched-domain trajectory training on Qwen3-VL-30B, it further improves average accuracy by 1. 65 points. These results show that tool-augmented VLM agents benefit not only from better perceptual tools, but also from explicit control over when tool outputs are worth paying for.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Yan Wang, Xuguang Ai, Jaisal Patel, Xueqing Peng, Fengran Mo, Yupeng Cao, Haohang Li, Mingyu Cao, Lingfei Qian, V\'ictor Guti\'errez-Basulto

4h ago

FeaturedOriginal

AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification

AI Summary

AuditFlow introduces a multi-agent framework for structured financial reporting verification, achieving 82.09% accuracy with GPT-5.5, outperforming the baseline by 14.93 points. It utilizes a symbolic environment for effective audit processes, demonstrating the necessity of deterministic checks for reliable verification.

#Agent #AI Coding #Inference #Enterprise AI