Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 27 API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and OSS competitors. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Training on code and high quality multi-turn alignment data could improve agent performance. Datasets, environments, and an integrated evaluation package for AgentBench are released at \url{https://github.com/THUDM/AgentBench}.
翻译:大型语言模型(LLMs)正变得越来越智能和自主,其目标已超越传统自然语言处理任务,转向现实世界中的实际问题求解。因此,亟需在交互式环境中评估LLMs作为智能体解决挑战性任务的能力。我们提出了AgentBench——一个多维动态演化基准测试,当前包含8个不同环境,用于评估LLM智能体在多轮开放式生成场景中的推理与决策能力。我们对27个基于API的开源(OSS)及闭源LLMs进行的广泛测试表明:虽然顶级商业LLMs在复杂环境中展现出强大的智能体能力,但闭源模型与开源竞争对手之间存在显著性能差距。我们识别了环境与LLMs中的典型失败原因,发现长期推理、决策及指令遵循能力不足是开发可用LLM智能体的主要障碍。基于代码的高质量多轮对齐数据训练可提升智能体性能。AgentBench的数据集、环境及集成评估套件已发布于\url{https://github.com/THUDM/AgentBench}。