AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

Xiaoyuan Liu,Jianhong Tu,Yuqi Chen,Siyuan Xie,Sihan Ren,Tianneng Shi,Gal Gantar,Evan Sandoval,Donghyun Lee,Daniel Miao,Peter J. Gilbert,Nick Hynes,Mauro Staver,Warren He,David Marn,Andrew Low,Xi Zhang,Elron Bandel,Michal Shmueli-Scheuer,Siva Reddy,Alexandre Drouin,Alexandre Lacoste,Ramayya Krishnan,Elham Tabassi,Yu Su,Victor Barres,Chenguang Wang,Wenbo Guo,Dawn Song

Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs. The root problem is the lack of an open, agent-agnostic assessment interface. We advocate Agentified Agent Assessment (AAA), where evaluation is performed by judge agents and all participants interact through standardized protocols: A2A for task management and MCP for tool access. Conventional benchmarking defines two separate interfaces, one for the benchmark and one for the agent, while AAA only needs one; this yields a generic, unified framework that separates assessment logic from agent implementation and enables reproducible, interoperable, and multi-agent evaluation. We further introduce AgentBeats as a concrete realization of AAA: we identify five practical operation modes that make standardized assessment compatible with real-world constraints on openness, privacy, and reproducibility. To evaluate our design at scale, we conduct two studies: a five-month open competition that drew 298 judge agents across 12 categories together with 467 subject agents from independent participants, showing that AAA applies across a heterogeneous range of benchmarks; and a case study on coding agents that confirms agentified evaluation preserves fidelity with the public record while surfacing previously missing head-to-head results, yielding research insights about agent design. Combining a community-scale field study and a controlled coding case study, we verify that AAA delivers coverage, practicality, and fidelity across heterogeneous scenarios at scale. Together, AAA and AgentBeats offer a clear path toward open, standardized, and reproducible agent assessment.

翻译：智能体系统在各领域快速发展，但其评估仍存在碎片化问题。多数基准测试依赖固定的、以大语言模型为中心的测试框架，这类框架需要深度集成，会造成测试与生产环境不匹配，并限制不同智能体设计间的公平比较。根本原因在于缺乏开放且与智能体无关的评估接口。我们提倡代理化智能体评估（AAA）方法，该方法由评判智能体执行评估，所有参与者通过标准化协议交互：任务管理采用A2A协议，工具访问采用MCP协议。传统基准测试定义了两个独立接口（分别用于基准测试和智能体），而AAA仅需一个接口，由此形成通用的统一框架，将评估逻辑与智能体实现相分离，实现可复现、可互操作及多智能体评估。我们进一步提出AgentBeats作为AAA的具体实现：识别出五种实用运行模式，使标准化评估能在开放性、隐私性和可复现性等现实约束下兼容。为验证设计方案的可扩展性，我们开展两项研究：其一为历时五个月的开放竞赛，吸引来自12个类别的298个评判智能体及独立参与者提交的467个被试智能体，证明AAA可适用于异构基准测试集合；其二为编码智能体案例研究，证实代理化评估在保留公开记录保真度的同时，揭示了此前缺失的智能体间直接对比结果，为智能体设计带来研究启示。通过社区规模实地研究与受控编码案例研究的结合，我们验证了AAA在异构场景下兼具覆盖范围、实用性与保真度。AAA与AgentBeats共同为开放、标准化、可复现的智能体评估提供了清晰路径。