LLM-based agents represent a paradigm shift in AI, enabling autonomous systems to plan, reason, and use tools while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methods for these increasingly capable agents. We analyze the field of agent evaluation across five perspectives: (1) Core LLM capabilities needed for agentic workflows, like planning, and tool use; (2) Application-specific benchmarks such as web and SWE agents; (3) Evaluation of generalist agents; (4) Analysis of agent benchmarks' core dimensions; and (5) Evaluation frameworks and tools for agent developers. Our analysis reveals current trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address, particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, scalable evaluation methods.
翻译:基于LLM的智能体标志着人工智能领域的范式转变,它们使自主系统能够规划、推理并使用工具,同时与动态环境进行交互。本文首次对这些能力日益增强的智能体评估方法进行了全面综述。我们从五个视角分析智能体评估领域:(1)智能体工作流所需的核心LLM能力,如规划和工具使用;(2)特定应用的基准测试,例如网页和SWE智能体;(3)通用智能体的评估;(4)智能体基准测试核心维度的分析;(5)面向智能体开发者的评估框架与工具。我们的分析揭示了当前趋势,包括向更现实、更具挑战性的评估转变,并伴随持续更新的基准测试。我们还指出了未来研究必须解决的关键空白,特别是在评估成本效益、安全性和鲁棒性方面,以及开发细粒度、可扩展的评估方法。