Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks. While most existing benchmarks assume simple, perfectly documented tools, real-world tools (e.g., general "search" APIs) are often opaque, lacking clear best practices or failure modes. Can LLM agents improve their performance in environments with opaque tools by interacting and subsequently improving documentation? To study this, we create OpaqueToolsBench, a benchmark consisting of three distinct task-oriented environments: general function calling, interactive chess playing, and long-trajectory agentic search. Each environment provides underspecified tools that models must learn to use effectively to complete the task. Results on OpaqueToolsBench suggest existing methods for automatically documenting tools are expensive and unreliable when tools are opaque. To address this, we propose a simple framework, ToolObserver, that iteratively refines tool documentation by observing execution feedback from tool-calling trajectories. Our approach outperforms existing methods on OpaqueToolsBench across datasets, even in relatively hard settings. Furthermore, for test-time tool exploration settings, our method is also efficient, consuming 3.5-7.5x fewer total tokens than the best baseline.
翻译:工具调用对于大型语言模型(LLM)智能体完成现实世界任务至关重要。虽然现有大多数基准测试假设工具简单且文档完善,但现实世界的工具(例如通用的“搜索”API)通常是不透明的,缺乏明确的最佳实践或故障模式。LLM智能体能否通过交互并随后改进文档,从而在具有不透明工具的环境中提升其性能?为研究此问题,我们创建了OpaqueToolsBench,这是一个由三个不同的面向任务环境组成的基准测试:通用函数调用、交互式国际象棋对弈和长轨迹智能体搜索。每个环境都提供了未充分说明的工具,模型必须学会有效使用这些工具以完成任务。在OpaqueToolsBench上的结果表明,当工具不透明时,现有自动生成工具文档的方法成本高昂且不可靠。为解决此问题,我们提出了一个简单的框架ToolObserver,该框架通过观察工具调用轨迹的执行反馈,迭代地完善工具文档。我们的方法在OpaqueToolsBench的所有数据集上均优于现有方法,即使在相对困难的设置中也是如此。此外,对于测试时工具探索设置,我们的方法同样高效,消耗的总令牌数比最佳基线少3.5-7.5倍。