Large language models (LLMs) increasingly rely on external tools and APIs to execute complex tasks specified in natural language. Evaluating such tool calling capabilities in realistic enterprise settings is challenging: APIs are often proprietary, heterogeneous, and difficult to share, limiting reproducible benchmarks. To address this, we introduce Live API Bench, a comprehensive benchmark constructed by transforming NL2SQL datasets into interactive API environments. Our pipeline converts SQL queries from BIRD SQL into executable API sequences across three formulations SLOT, SEL, and REST covering minimal general purpose operations, domain specific multi step tasks, and function oriented RESTful interactions, respectively. The benchmark spans 11 databases with over 2,500 invocable tools, paired with human authored queries, ground truth API sequences, and verified final answers. Live API Bench enables systematic evaluation of core challenges in tool use, including error handling, sequential reasoning, parameter generation, response parsing, and robustness across diverse domains. We evaluate 10 LLMs and 4 ReACT agents, observing low task completion rates (7 to 47pct), which improve modestly to 50pct under interactive agent settings, highlighting substantial scope for improving LLM tool calling performance. We release all code and data associated with this paper.
翻译:大型语言模型(LLMs)日益依赖外部工具和API来执行自然语言指定的复杂任务。在现实的企业环境中评估此类工具调用能力具有挑战性:API通常是专有的、异构的且难以共享,这限制了可复现的基准测试。为解决此问题,我们引入了Live API Bench,这是一个通过将NL2SQL数据集转换为交互式API环境构建的综合性基准测试。我们的流水线将来自BIRD SQL的SQL查询转换为可执行的API序列,涵盖三种表述形式:SLOT、SEL和REST,分别对应最小化通用操作、领域特定多步任务以及面向函数的RESTful交互。该基准测试涵盖11个数据库,包含超过2500个可调用工具,并配有人工编写的查询、真实API序列及经过验证的最终答案。Live API Bench支持对工具使用中的核心挑战进行系统性评估,包括错误处理、顺序推理、参数生成、响应解析以及跨不同领域的鲁棒性。我们评估了10个LLMs和4个ReACT智能体,观察到较低的任务完成率(7%至47%),在交互式智能体设置下仅适度提升至50%,这突显了提升LLM工具调用性能的巨大空间。我们发布了与本文相关的所有代码和数据。