AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

Evaluation of software engineering (SWE) agents is dominated by a binary signal: whether the final patch passes the tests. This outcome-only view treats a principled solution and a chaotic trial-and-error process as equivalent. We show that this equivalence is empirically false. We evaluate 2,614 OpenHands trajectories from eight model backends on 60 SWE-bench Verified tasks. Of these, 47 have enough passing trajectories to construct task-level process references, yielding a 1,815-trajectory evaluation subset. Among passing trajectories in this subset, 10.7% exhibit behavior we call a Lucky Pass: regression cycles, blind retries, missing verification, or temporally disordered exploration, implementation, and verification. We introduce AgentLens, a framework for process-level assessment of SWE-agent trajectories, and define AgentLens-Bench, a dataset of 1,815 trajectories annotated with quality scores, waste signals, divergence points, and 47 task-level Prefix Tree Acceptor (PTA) references. AgentLens builds PTA references by merging multiple passing solutions for the same task, and uses a context-sensitive intent labeler to assign actions to Exploration, Implementation, Verification, or Orchestration based on trajectory history rather than tool identity alone. On AgentLens-Bench, the quality score separates passing trajectories into Lucky, Solid, and Ideal tiers and further decomposes Lucky Passes into five recurring mechanisms. Across the eight model backends, Lucky rates range from 0.5% to 23.2%, and some models move by as many as five rank positions when ranked by quality score instead of pass rate. We plan to release the project repository soon, including AgentLens-Bench artifacts, the AgentLens SDK, and the analysis tooling.

翻译：软件工程（SWE）智能体的评估主要依赖二元信号：最终补丁是否通过测试。这种仅关注结果的观点将严谨的解决方案与混乱的试错过程等同视之。我们证明了这种等价性在经验上并不成立。我们评估了基于八个模型后端的2,614条OpenHands轨迹在60个SWE-bench Verified任务上的表现。其中47个任务拥有足够数量的通过轨迹，可用于构建任务级过程参考，从而得到包含1,815条轨迹的评估子集。在该子集的通过轨迹中，10.7%表现出我们称之为“幸运通过”的行为：回归循环、盲目重试、缺少验证，或探索、实现与验证在时间序列上的混乱。我们引入了AgentLens——一个用于SWE智能体轨迹过程级评估的框架，并定义了AgentLens-Bench数据集，包含1,815条带有质量分数、浪费信号、分歧点及47个任务级前缀树接受器（PTA）参考标注的轨迹。AgentLens通过合并同一任务的多个通过解决方案构建PTA参考，并利用基于轨迹历史而非仅凭工具身份的上下文敏感意图标注器，将动作划分为探索、实现、验证或编排。在AgentLens-Bench上，质量分数将通过轨迹分为“幸运”、“坚实”与“理想”三个层级，并进一步将“幸运通过”分解为五种重复出现的机制。在八个模型后端中，“幸运”率介于0.5%至23.2%之间，当按质量分数而非通过率排序时，部分模型的排名变动幅度高达五个名次。我们计划于近期发布项目仓库，包括AgentLens-Bench工件、AgentLens SDK及分析工具。