AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

Xiaoyuan Liu,Jianhong Tu,Yuqi Chen,Siyuan Xie,Sihan Ren,Tianneng Shi,Gal Gantar,Evan Sandoval,Donghyun Lee,Daniel Miao,Peter J. Gilbert,Nick Hynes,Mauro Staver,Warren He,David Marn,Andrew Low,Xi Zhang,Elron Bandel,Michal Shmueli-Scheuer,Siva Reddy,Alexandre Drouin,Alexandre Lacoste,Ramayya Krishnan,Elham Tabassi,Yu Su,Victor Barres,Chenguang Wang,Wenbo Guo,Dawn Song

Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs. The root problem is the lack of an open, agent-agnostic assessment interface. We advocate Agentified Agent Assessment (AAA), where evaluation is performed by judge agents and all participants interact through standardized protocols: A2A for task management and MCP for tool access. Conventional benchmarking defines two separate interfaces, one for the benchmark and one for the agent, while AAA only needs one; this yields a generic, unified framework that separates assessment logic from agent implementation and enables reproducible, interoperable, and multi-agent evaluation. We further introduce AgentBeats as a concrete realization of AAA: we identify five practical operation modes that make standardized assessment compatible with real-world constraints on openness, privacy, and reproducibility. To evaluate our design at scale, we conduct two studies: a five-month open competition that drew 298 judge agents across 12 categories together with 467 subject agents from independent participants, showing that AAA applies across a heterogeneous range of benchmarks; and a case study on coding agents that confirms agentified evaluation preserves fidelity with the public record while surfacing previously missing head-to-head results, yielding research insights about agent design. Combining a community-scale field study and a controlled coding case study, we verify that AAA delivers coverage, practicality, and fidelity across heterogeneous scenarios at scale. Together, AAA and AgentBeats offer a clear path toward open, standardized, and reproducible agent assessment.

翻译：智能体系统在各领域快速发展的同时，其评估方法仍存在碎片化问题。现有基准测试多依赖固定的大语言模型中心化工具链，存在集成复杂度高、测试与生产环境不一致、难以实现跨异构智能体设计公平比较等局限。根本原因在于缺乏开放、通用型的智能体评估接口。我们提出"代理化智能体评估"（AAA）范式，由评判代理执行评估任务，所有参与方通过标准化协议交互：A2A协议负责任务管理，MCP协议提供工具接口。传统基准测试需为评估系统与待测智能体分别定义独立接口，而AAA仅需统一接口，从而构建出将评估逻辑与智能体实现解耦的通用框架，支持可复现、可互操作的多智能体评估。我们进一步实现AAA的具体化系统AgentBeats：识别出五种实际运行模式，使标准化评估适配现实场景中关于开放性、隐私性与可复现性的约束约束。为检验设计方案的可扩展性，我们开展两项研究：其一为历时五个月的开放竞赛，吸引来自12个类别的298个评判智能体与独立参与者的467个受测智能体，验证了AAA在异构基准测试中的适用性；其二针对编程智能体的案例研究表明，代理化评估在保持与公开记录保真度的同时，能揭示此前研究中缺失的智能体间直接对比结果，进而提炼出关于智能体设计的学术洞见。通过社区规模现场实验与受控编程案例研究的结合，我们验证了AAA在异构场景中规模化的覆盖度、实用性与保真度。AAA与AgentBeats共同为构建开放、标准化、可复现的智能体评估体系提供了清晰路径。