Significant focus has been placed on integrating large language models (LLMs) with various tools in developing general-purpose agents. This poses a challenge to LLMs' tool-use capabilities. However, there are evident gaps between existing tool-use evaluations and real-world scenarios. Current evaluations often use AI-generated queries, single-step tasks, dummy tools, and text-only interactions, failing to reveal the agents' real-world problem-solving abilities effectively. To address this, we propose GTA, a benchmark for General Tool Agents, featuring three main aspects: (i) Real user queries: human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps. (ii) Real deployed tools: an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate the agents' actual task execution performance. (iii) Real multimodal inputs: authentic image files, such as spatial scenes, web page screenshots, tables, code snippets, and printed/handwritten materials, used as the query contexts to align with real-world scenarios closely. We design 229 real-world tasks and executable tool chains to evaluate mainstream LLMs. Our findings show that real-world user queries are challenging for existing LLMs, with GPT-4 completing less than 50% of the tasks and most LLMs achieving below 25%. This evaluation reveals the bottlenecks in the tool-use capabilities of current LLMs in real-world scenarios, which provides future direction for advancing general-purpose tool agents. The code and dataset are available at https://github.com/open-compass/GTA.
翻译:当前,开发通用智能体时,将大语言模型(LLMs)与各类工具相结合已成为重要研究方向,这对LLMs的工具使用能力提出了挑战。然而,现有的工具使用评估与真实应用场景之间存在明显差距。目前的评估通常采用AI生成的查询、单步任务、模拟工具及纯文本交互,难以有效揭示智能体在真实世界中的问题解决能力。为此,我们提出了GTA——一个面向通用工具智能体的基准测试,其具备三个核心特征:(i)真实用户查询:采用人工撰写的、具有简单现实目标但隐含工具使用需求的查询,要求LLM推理出合适的工具并规划解决步骤。(ii)真实部署工具:构建了一个评估平台,配备涵盖感知、操作、逻辑与创意类别的工具,以评估智能体的实际任务执行性能。(iii)真实多模态输入:使用真实的图像文件作为查询上下文,包括空间场景、网页截图、表格、代码片段及印刷/手写材料等,以紧密贴合实际应用场景。我们设计了229项现实世界任务及可执行工具链,对主流LLMs进行了评估。研究结果表明,现有LLMs在处理真实用户查询时面临显著困难:GPT-4仅能完成不足50%的任务,而大多数LLMs的成功率低于25%。此项评估揭示了当前LLMs在真实场景中工具使用能力的瓶颈,为推进通用工具智能体的发展指明了方向。代码与数据集已公开于https://github.com/open-compass/GTA。