Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs. The root problem is the lack of an open, agent-agnostic assessment interface. We advocate Agentified Agent Assessment (AAA), where evaluation is performed by judge agents and all participants interact through standardized protocols: A2A for task management and MCP for tool access. Conventional benchmarking defines two separate interfaces, one for the benchmark and one for the agent, while AAA only needs one; this yields a generic, unified framework that separates assessment logic from agent implementation and enables reproducible, interoperable, and multi-agent evaluation. We further introduce AgentBeats as a concrete realization of AAA: we identify five practical operation modes that make standardized assessment compatible with real-world constraints on openness, privacy, and reproducibility. To evaluate our design at scale, we conduct two studies: a five-month open competition that drew 298 judge agents across 12 categories together with 467 subject agents from independent participants, showing that AAA applies across a heterogeneous range of benchmarks; and a case study on coding agents that confirms agentified evaluation preserves fidelity with the public record while surfacing previously missing head-to-head results, yielding research insights about agent design. Combining a community-scale field study and a controlled coding case study, we verify that AAA delivers coverage, practicality, and fidelity across heterogeneous scenarios at scale. Together, AAA and AgentBeats offer a clear path toward open, standardized, and reproducible agent assessment.
翻译:智能体系统在各领域快速发展的同时,其评估方法仍存在碎片化问题。现有基准测试多依赖固定的大语言模型中心化工具链,存在集成复杂度高、测试与生产环境不一致、难以实现跨异构智能体设计公平比较等局限。根本原因在于缺乏开放、通用型的智能体评估接口。我们提出"代理化智能体评估"(AAA)范式,由评判代理执行评估任务,所有参与方通过标准化协议交互:A2A协议负责任务管理,MCP协议提供工具接口。传统基准测试需为评估系统与待测智能体分别定义独立接口,而AAA仅需统一接口,从而构建出将评估逻辑与智能体实现解耦的通用框架,支持可复现、可互操作的多智能体评估。我们进一步实现AAA的具体化系统AgentBeats:识别出五种实际运行模式,使标准化评估适配现实场景中关于开放性、隐私性与可复现性的约束约束。为检验设计方案的可扩展性,我们开展两项研究:其一为历时五个月的开放竞赛,吸引来自12个类别的298个评判智能体与独立参与者的467个受测智能体,验证了AAA在异构基准测试中的适用性;其二针对编程智能体的案例研究表明,代理化评估在保持与公开记录保真度的同时,能揭示此前研究中缺失的智能体间直接对比结果,进而提炼出关于智能体设计的学术洞见。通过社区规模现场实验与受控编程案例研究的结合,我们验证了AAA在异构场景中规模化的覆盖度、实用性与保真度。AAA与AgentBeats共同为构建开放、标准化、可复现的智能体评估体系提供了清晰路径。