Evaluating large language models (LLMs) as general-purpose agents is essential for understanding their capabilities and facilitating their integration into practical applications. However, the evaluation process presents substantial challenges. A primary obstacle is the benchmarking of agent performance across diverse scenarios within a unified framework, especially in maintaining partially-observable environments and ensuring multi-round interactions. Moreover, current evaluation frameworks mostly focus on the final success rate, revealing few insights during the process and failing to provide a deep understanding of the model abilities. To address these challenges, we introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents. AgentBoard offers a fine-grained progress rate metric that captures incremental advancements as well as a comprehensive evaluation toolkit that features easy assessment of agents for multi-faceted analysis through interactive visualization. This not only sheds light on the capabilities and limitations of LLM agents but also propels the interpretability of their performance to the forefront. Ultimately, AgentBoard serves as a significant step towards demystifying agent behaviors and accelerating the development of stronger LLM agents.
翻译:将大语言模型作为通用智能体进行性能评测,对理解其能力并推动实际应用落地至关重要。然而,评估过程面临诸多挑战:首要障碍是在统一框架下跨场景基准测试智能体性能,尤其需维护部分可观测环境并确保多轮交互;此外,现有评估框架多聚焦最终成功率,难以揭示过程性洞见,无法深入理解模型能力。为解决这些问题,我们提出AgentBoard——首个专为分析性评估大语言模型智能体设计的综合性基准测试与开源评估框架。AgentBoard提供细粒度进程率指标以捕获渐进式进展,并配备通过交互式可视化实现多维度分析的一站式评估工具包。这不仅揭示了大语言模型智能体的能力边界,更将其性能可解释性提升至核心地位。最终,AgentBoard为解构智能体行为本质、加速更强智能体研发奠定重要基础。