GTA: A Benchmark for General Tool Agents

Significant focus has been placed on integrating large language models (LLMs) with various tools in developing general-purpose agents. This poses a challenge to LLMs' tool-use capabilities. However, there are evident gaps between existing tool-use evaluations and real-world scenarios. Current evaluations often use AI-generated queries, single-step tasks, dummy tools, and text-only interactions, failing to reveal the agents' real-world problem-solving abilities effectively. To address this, we propose GTA, a benchmark for General Tool Agents, featuring three main aspects: (i) Real user queries: human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps. (ii) Real deployed tools: an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate the agents' actual task execution performance. (iii) Real multimodal inputs: authentic image files, such as spatial scenes, web page screenshots, tables, code snippets, and printed/handwritten materials, used as the query contexts to align with real-world scenarios closely. We design 229 real-world tasks and executable tool chains to evaluate mainstream LLMs. Our findings show that real-world user queries are challenging for existing LLMs, with GPT-4 completing less than 50% of the tasks and most LLMs achieving below 25%. This evaluation reveals the bottlenecks in the tool-use capabilities of current LLMs in real-world scenarios, which provides future direction for advancing general-purpose tool agents. The code and dataset are available at https://github.com/open-compass/GTA.

翻译：当前，将大语言模型（LLM）与各类工具集成以开发通用智能体已成为研究重点，这对LLM的工具使用能力提出了挑战。然而，现有工具使用评估与现实场景之间存在明显差距。目前的评估通常采用AI生成的查询、单步任务、模拟工具及纯文本交互，难以有效揭示智能体在真实世界中的问题解决能力。为此，我们提出了GTA——一个面向通用工具智能体的基准测试，其具备三个核心特点：（i）真实用户查询：采用人工撰写的、具有简单现实目标但隐含工具使用需求的查询，要求LLM推理出合适工具并规划解决步骤。（ii）真实部署工具：构建配备感知、操作、逻辑与创意等多类别工具的评估平台，以评测智能体的实际任务执行性能。（iii）真实多模态输入：使用真实图像文件（如空间场景、网页截图、表格、代码片段及印刷/手写材料）作为查询上下文，紧密贴合现实场景。我们设计了229项现实世界任务及可执行工具链来评估主流LLM。实验结果表明，现实用户查询对现有LLM具有显著挑战性：GPT-4仅能完成不足50%的任务，而大多数LLM的成功率低于25%。本评估揭示了当前LLM在现实场景中工具使用能力的瓶颈，为推进通用工具智能体的发展指明了方向。代码与数据集公开于https://github.com/open-compass/GTA。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日