Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 25 LLMs (including APIs and open-sourced models) shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and open-sourced competitors. It also serves as a component of an ongoing project with wider coverage and deeper consideration towards systematic LLM evaluation. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench
翻译:大语言模型正日益变得智能和自主,其目标已超越传统自然语言处理任务,转向现实世界中的实际应用。因此,亟需在交互式环境中评估作为智能体的大语言模型解决挑战性任务的能力。我们提出AgentBench——一个多维度演进式基准测试,目前包含8个不同环境,旨在评估大语言模型在多轮开放生成场景中作为智能体的推理与决策能力。我们对25个大语言模型(包括API和开源模型)进行的广泛测试表明:尽管顶级商业大语言模型在复杂环境中展现出强大的智能体能力,但其与开源竞品之间存在显著的性能差距。该项目作为持续进行的系统性大语言模型评估计划的一部分,具备更广泛的覆盖范围和更深入的考量。AgentBench的数据集、环境及集成评估包已发布于https://github.com/THUDM/AgentBench。