ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM vocabulary, fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks. Yet these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neither reveals whether the model actually understands its tools. We introduce \textbf{ToolSense}, an open-source LLM-powered diagnostic framework that takes any tool catalog as input and automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with queries at three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark. Applying ToolSense to ToolBench (~47k tools) and evaluating five parametric model training configurations reveals a knowledge-retrieval dissociation: on RRB queries, several configurations collapse by ~50-64 percentage points compared to fully-specified ToolBench benchmarks, falling below the embedding-model baseline. Additionally, despite strong retrieval performance, some models score near-random on factual probes, suggesting a knowledge-retrieval dissociation. We open-source the ToolSense framework and the ToolBench diagnostic benchmarks at https://github.com/SAP/toolsense.

翻译：作为大语言模型在大型工具目录上运行的代理，面临一个关键的工具检索瓶颈。由于基于嵌入的检索方法依赖可能无法充分捕获专业工具语义的紧凑编码器，参数化工具检索通过将每个工具编码为附加到LLM词汇表的虚拟token来解决此问题，并通过两阶段微调（记忆阶段和检索SFT阶段）使LLM充当检索器，在标准ToolBench检索基准测试上取得了强性能。然而，这些基准测试使用冗长且完全指定的查询，其评估采用限制输出为有效token路径的约束解码方式，这并不能揭示模型是否真正理解其工具。我们提出**ToolSense**，一个基于LLM的开源诊断框架，该框架以任意工具目录为输入，自动生成三个基准测试集：包含三个模糊层级的真实检索基准测试（RRB）、多项选择探测基准测试和问答探测基准测试。将ToolSense应用于ToolBench（约4.7万个工具）并评估五种参数化模型训练配置，揭示了知识-检索解离现象：在RRB查询上，与完全指定的ToolBench基准测试相比，几种配置的性能下降了约50-64个百分点，甚至低于嵌入模型基线。此外，尽管检索性能强劲，某些模型在事实探测任务上仍表现出接近随机的得分，这暗示存在知识-检索解离。我们在https://github.com/SAP/toolsense 开源了ToolSense框架和ToolBench诊断基准测试。