Agent evaluation requires assessing complex multi-step behaviors involving tool use and intermediate reasoning, making it costly and expertise-intensive. A natural question arises: can frontier coding assistants reliably automate this evaluation process? Our study shows that simply prompting coding assistants is insufficient for this task. Without domain-specific evaluation knowledge, frontier coding assistants achieve only a 30% execution success rate and produce over-engineered evaluations averaging 12+ metrics per agent, indicating that strong coding ability does not automatically translate to reliable agent evaluation. We introduce EvalAgent, an AI assistant that automates the end-to-end agent evaluation pipeline. EvalAgent encodes evaluation domain expertise as evaluation skills (procedural instructions, reusable code and templates, and dynamically retrieved API documentation) that compose into a trace-based pipeline producing complete evaluation artifacts including metrics, executable code, and reports. To systematically assess generated evaluations, we introduce a meta-evaluation framework alongside AgentEvalBench, a benchmark comprising 20 agents, each paired with evaluation requirements and test scenarios. We further propose the Eval@1 metric to measure whether generated evaluation code both executes and yields meaningful results on the first run. Our experiments show that EvalAgent produces focused evaluations, improving Eval@1 from 17.5% to 65%, and achieving 79.5% human expert preference over baseline approaches. Further ablation studies show that evaluation skills are critical for handling complex evaluation: removing them causes Eval@1 to drop significantly from 65% to 30%.
翻译:Agent评估需要评估涉及工具使用及中间推理的复杂多步行为,这使得评估过程成本高昂且需要专业知识。一个自然的问题随之产生:前沿编程助手能否可靠地自动化这一评估流程?本研究表明,仅通过提示编程助手不足以完成此任务。在缺乏领域专属评估知识的情况下,前沿编程助手仅能实现30%的执行成功率,且生成的评估方案平均每个Agent包含12项以上过度工程的评估指标——这表明强大的编程能力并不能自动转化为可靠的Agent评估能力。我们提出EvalAgent,一个自动化端到端Agent评估管线的AI助手。EvalAgent将评估领域专业知识编码为评估技能(包含程序化指令、可复用代码模板及动态检索的API文档),这些技能可组合成基于追踪的管线,生成包含指标、可执行代码及报告的完整评估产物。为系统性评估生成的评估方案,我们引入元评估框架及AgentEvalBench基准(包含20个Agent,每个均配备评估需求与测试场景)。我们进一步提出Eval@1指标,用于衡量生成的评估代码在首次运行时能否执行并产生有意义的结果。实验表明,EvalAgent能生成聚焦的评估方案,将Eval@1从17.5%提升至65%,并在基线方法中获得79.5%的人类专家偏好率。消融实验进一步表明评估技能对处理复杂评估至关重要:移除评估技能导致Eval@1从65%骤降至30%。