Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved), neglecting fine-grained agent capabilities such as repository understanding, context retrieval, code localization, and bug diagnosis. In this paper, we introduce SWE-Explore, a benchmark that isolates the evaluation of repository exploration, a critical capability of coding agents. Given a repository and an issue, SWE-Explore asks an explorer to return a ranked list of relevant code regions under a fixed line budget. SWE-Explore covers 848 issues across 10 programming languages and 203 open-source repositories. For each instance, we derive line-level ground truth from independent agent trajectories that successfully solved the same issue, distilling the specific code regions their solution paths actually consulted. We evaluate exploration along coverage, ranking, and context-efficiency dimensions, showing that these metrics strongly track downstream repair behavior. Across a broad set of retrieval methods, general coding agents, and specialized localizers, we find that agentic explorers form a clear tier above classical retrieval. While file-level localization is already strong for modern methods, line-level coverage and efficient ranking remain the key axes differentiating state-of-the-art explorers.
翻译:仓库级编码基准测试(如SWE-bench)极大推动了编码智能体能力的快速提升。然而,这些基准通常将编码任务视为整体性的二元预测问题(如已解决或未解决),忽略了智能体在仓库理解、上下文检索、代码定位和缺陷诊断等细粒度能力。本文提出SWE-Explore基准,专注于评估编码智能体的关键能力——仓库探索。给定代码仓库与问题报告,SWE-Explore要求探索者在固定代码行数预算内返回相关代码区域的排序列表。该基准涵盖203个开源仓库中10种编程语言的848个问题实例。每个实例的行级标注数据源自独立智能体成功解决同一问题时的轨迹,从中提取其解决方案实际参考的特定代码区域。我们从覆盖率、排序和上下文效率三个维度评估探索能力,结果表明这些指标与下游修复行为高度相关。通过对广泛检索方法、通用编码智能体及专用定位器的比较发现,基于智能体的探索器明显优于经典检索方法。尽管现代方法在文件级定位方面已表现优异,但在行级覆盖和高效排序方面仍是区分前沿探索器的关键维度。