ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

arXiv cs.AI·Ashutosh Hathidara, Sai Shruthi Sistla, Sebastian Schreiber, Sahil Bansal

1d ago

·~2 min·6/12/2026·en·1

Quick Answer

Quick Take

ToolSense is an open-source diagnostic framework that evaluates parametric tool retrieval in LLMs, revealing a 50-64 percentage point drop in performance on realistic queries compared to standard benchmarks. This indicates a significant knowledge-retrieval dissociation, as some models perform poorly on factual probes despite strong retrieval scores. The framework is available at https://github.com/SAP/toolsense.

Key Points

ToolSense generates three benchmarks for tool retrieval evaluation: RRB, MCQ, and QA.
Performance on RRB queries dropped significantly, revealing knowledge-retrieval dissociation.
Five parametric model configurations were tested against ~47k tools in ToolBench.
Despite high retrieval scores, some models scored near-random on factual probes.
ToolSense framework and benchmarks are open-sourced for public use.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 12451v1 Announce Type: new Abstract: Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck.

As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM vocabulary, fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks.

Yet these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neither reveals whether the model actually understands its tools. We introduce \textbf{ToolSense}, an open-source LLM-powered diagnostic framework that takes any tool catalog as input and automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with queries at three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark.

Applying ToolSense to ToolBench (~47k tools) and evaluating five parametric model training configurations reveals a knowledge-retrieval dissociation: on RRB queries, several configurations collapse by ~50-64 percentage points compared to fully-specified ToolBench benchmarks, falling below the embedding-model baseline. Additionally, despite strong retrieval performance, some models score near-random on factual probes, suggesting a knowledge-retrieval dissociation.

We open-source the ToolSense framework and the ToolBench diagnostic benchmarks at https://github. com/SAP/toolsense.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Neha Prakriya, Chaojun Hou, Zheng Gong, Huasha Zhao, Xi Zhao, Mou Li, Zhenyu Gu, Emad Barsoum

1d ago

FeaturedOriginal

Arbor: Tree Search as a Cognition Layer for Autonomous Agents

AI Summary

Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.

#LLM #Agent #Inference #AI Startup