Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs. The root problem is the lack of an open, agent-agnostic assessment interface. We advocate Agentified Agent Assessment (AAA), where evaluation is performed by judge agents and all participants interact through standardized protocols: A2A for task management and MCP for tool access. Conventional benchmarking defines two separate interfaces, one for the benchmark and one for the agent, while AAA only needs one; this yields a generic, unified framework that separates assessment logic from agent implementation and enables reproducible, interoperable, and multi-agent evaluation. We further introduce AgentBeats as a concrete realization of AAA: we identify five practical operation modes that make standardized assessment compatible with real-world constraints on openness, privacy, and reproducibility. To evaluate our design at scale, we conduct two studies: a five-month open competition that drew 298 judge agents across 12 categories together with 467 subject agents from independent participants, showing that AAA applies across a heterogeneous range of benchmarks; and a case study on coding agents that confirms agentified evaluation preserves fidelity with the public record while surfacing previously missing head-to-head results, yielding research insights about agent design. Combining a community-scale field study and a controlled coding case study, we verify that AAA delivers coverage, practicality, and fidelity across heterogeneous scenarios at scale. Together, AAA and AgentBeats offer a clear path toward open, standardized, and reproducible agent assessment.
翻译:智能体系统在各领域快速发展,但其评估仍存在碎片化问题。多数基准测试依赖固定的、以大语言模型为中心的测试框架,这类框架需要深度集成,会造成测试与生产环境不匹配,并限制不同智能体设计间的公平比较。根本原因在于缺乏开放且与智能体无关的评估接口。我们提倡代理化智能体评估(AAA)方法,该方法由评判智能体执行评估,所有参与者通过标准化协议交互:任务管理采用A2A协议,工具访问采用MCP协议。传统基准测试定义了两个独立接口(分别用于基准测试和智能体),而AAA仅需一个接口,由此形成通用的统一框架,将评估逻辑与智能体实现相分离,实现可复现、可互操作及多智能体评估。我们进一步提出AgentBeats作为AAA的具体实现:识别出五种实用运行模式,使标准化评估能在开放性、隐私性和可复现性等现实约束下兼容。为验证设计方案的可扩展性,我们开展两项研究:其一为历时五个月的开放竞赛,吸引来自12个类别的298个评判智能体及独立参与者提交的467个被试智能体,证明AAA可适用于异构基准测试集合;其二为编码智能体案例研究,证实代理化评估在保留公开记录保真度的同时,揭示了此前缺失的智能体间直接对比结果,为智能体设计带来研究启示。通过社区规模实地研究与受控编码案例研究的结合,我们验证了AAA在异构场景下兼具覆盖范围、实用性与保真度。AAA与AgentBeats共同为开放、标准化、可复现的智能体评估提供了清晰路径。