Tool-augmented large language model agents increasingly operate over large tool libraries, but existing evaluations often focus on whether a model can call a tool correctly rather than how the visible tool menu shapes reliability, efficiency, and safety-relevant risk exposure. We introduce ToolMenuBench, a benchmark for evaluating tool-menu construction in multi-step LLM agents. ToolMenuBench varies tool-menu size, distractor type, state-dependent task structure, and risk exposure, and reports both filter-level and downstream agent metrics, including visible-tool count, risky-tool exposure, task success, wrong-tool calls, premature actions, and token usage. In a controlled evaluation across seven model backends, three tool-menu sizes, six filtering methods, and seven evaluation settings, CMTF improves task success from 32.1% under all-tools exposure to 85.7%, while reducing average token usage by roughly 98%. Causal minimal tool filtering achieves the strongest overall tradeoff, reducing visible tools, wrong-tool calls, premature actions, and risky-tool exposure relative to unfiltered exposure, lexical filtering, state-aware filtering, and broader causal-path baselines. ToolMenuBench provides a reusable evaluation framework for studying the agent-interface problem: which tools should be visible, when they should be visible, and under what cost or risk constraints.
翻译:工具增强型大语言模型代理越来越多地操作于大型工具库之上,但现有评估通常关注模型能否正确调用工具,而非可见工具菜单如何塑造可靠性、效率及安全相关风险暴露。我们提出ToolMenuBench——一个评估多步骤LLM代理中工具菜单构建的基准测试。ToolMenuBench通过变化工具菜单规模、干扰类型、状态依赖的任务结构及风险暴露,报告过滤层与下游代理指标,包括可见工具数量、风险工具暴露、任务成功率、错误工具调用、过早行动及令牌使用量。在涵盖七种模型后端、三种工具菜单规模、六种过滤方法及七种评估设置的控制实验中,CMTF将任务成功率从全工具暴露下的32.1%提升至85.7%,同时平均令牌使用量降低约98%。因果最小工具过滤实现了最佳整体权衡,相较于未过滤暴露、词汇过滤、状态感知过滤及更广泛的因果路径基线,减少了可见工具、错误工具调用、过早行动及风险工具暴露。ToolMenuBench为研究"代理-界面问题"(哪些工具应可见、何时可见、以及受何种成本或风险约束)提供了可复用的评估框架。