ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents

Recent advancements in integrating large language models (LLMs) with application programming interfaces (APIs) have gained significant interest in both academia and industry. These API-based agents, leveraging the strong autonomy and planning capabilities of LLMs, can efficiently solve problems requiring multi-step actions. However, their ability to handle multi-dimensional difficulty levels, diverse task types, and real-world demands through APIs remains unknown. In this paper, we introduce \textsc{ShortcutsBench}, a large-scale benchmark for the comprehensive evaluation of API-based agents in solving tasks with varying levels of difficulty, diverse task types, and real-world demands. \textsc{ShortcutsBench} includes a wealth of real APIs from Apple Inc.'s operating systems, refined user queries from shortcuts, human-annotated high-quality action sequences from shortcut developers, and accurate parameter filling values about primitive parameter types, enum parameter types, outputs from previous actions, and parameters that need to request necessary information from the system or user. Our extensive evaluation of agents built with $5$ leading open-source (size >= 57B) and $4$ closed-source LLMs (e.g. Gemini-1.5-Pro and GPT-3.5) reveals significant limitations in handling complex queries related to API selection, parameter filling, and requesting necessary information from systems and users. These findings highlight the challenges that API-based agents face in effectively fulfilling real and complex user queries. All datasets, code, and experimental results will be available at \url{https://github.com/eachsheep/shortcutsbench}.

翻译：近年来，将大语言模型（LLMs）与应用程序编程接口（APIs）相结合的技术进展在学术界和工业界引起了广泛关注。这类基于API的智能体，借助LLMs强大的自主性和规划能力，能够高效解决需要多步操作的问题。然而，它们处理多维难度级别、多样化任务类型以及通过API满足现实世界需求的能力尚不明确。本文提出了\textsc{ShortcutsBench}，一个用于全面评估基于API的智能体在解决不同难度级别、多样化任务类型及现实需求任务方面性能的大规模基准测试。\textsc{ShortcutsBench}包含了来自苹果公司操作系统的丰富真实API、从快捷指令中提炼的用户查询、由快捷指令开发者人工标注的高质量动作序列，以及关于基本参数类型、枚举参数类型、先前动作的输出、以及需要向系统或用户请求必要信息的参数等准确参数填充值。我们对基于$5$个领先开源（模型大小 >= 57B）和$4$个闭源LLMs（例如Gemini-1.5-Pro和GPT-3.5）构建的智能体进行了广泛评估，结果揭示了它们在处理涉及API选择、参数填充以及向系统和用户请求必要信息的复杂查询时存在显著局限性。这些发现凸显了基于API的智能体在有效满足真实且复杂的用户查询方面所面临的挑战。所有数据集、代码及实验结果均可在\url{https://github.com/eachsheep/shortcutsbench}获取。