Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks. Most papers report a pass@1 score computed from a single run per task, assuming this gives a reliable performance estimate. We test this assumption by collecting 60,000 agentic trajectories on SWE-Bench-Verified, spanning three models and two scaffolds. We find substantial variance: single-run pass@1 estimates vary by 2.2 to 6.0 percentage points depending on which run is selected, with standard deviations exceeding 1.5 percentage points even at temperature 0. This variance has critical implications: reported improvements of 2--3 percentage points may reflect evaluation noise rather than genuine algorithmic progress. Through token-level analysis, we show that trajectories diverge early, often within the first few percent of tokens, and that these small differences cascade into different solution strategies. To enable reliable evaluation of agentic systems, we recommend three concrete practices: (1) estimate pass@1 from multiple independent runs per task, especially when measuring small improvements, (2) use statistical power analysis to determine the number of runs needed to detect expected effect sizes, and (3) consider metrics like pass@k (optimistic bound) and pass^k (pessimistic bound) with k>1 to better characterize the full performance envelope. While these practices increase evaluation cost, they are essential for distinguishing genuine scientific progress from statistical noise.
翻译:智能体系统通常在基准测试中进行评估,其中智能体与环境交互以完成任务。大多数论文报告基于每个任务单次运行计算的pass@1分数,并假设这能提供可靠的性能估计。我们通过收集SWE-Bench-Verified上的60,000条智能体轨迹(涵盖三种模型和两种框架)来检验这一假设。我们发现存在显著方差:单次运行的pass@1估计值根据所选运行的不同会产生2.2至6.0个百分点的波动,即使在temperature=0时标准差也超过1.5个百分点。这种方差具有关键影响:所报告的2-3个百分点改进可能反映的是评估噪声而非真正的算法进步。通过令牌级分析,我们证明轨迹在早期即发生分化(通常在前百分之几的令牌内),且这些微小差异会级联形成不同的解决策略。为实现智能体系统的可靠评估,我们建议三项具体实践:(1) 基于每个任务的多次独立运行估计pass@1,特别是在测量微小改进时;(2) 使用统计功效分析确定检测预期效应量所需的运行次数;(3) 考虑采用k>1的指标(如pass@k乐观边界和pass^k悲观边界)以更好表征完整性能边界。虽然这些实践会增加评估成本,但它们对于区分真正的科学进展与统计噪声至关重要。