Exploration Structure in LLM Agents for Multi-File Change Localization

Software engineering tools increasingly rely on LLM based agents to localize files to change to resolve a software issue. Most AI agents explore repositories linearly, that is, visiting one directory or file per step. We postulate that this is a structural mismatch for changes that span several subsystems. We compare linear sequential exploration against non-linear, domain-scoped parallel agentic exploration. Using SWE Bench Pro as initial benchmark, we focus on ansible as an exemplar. We construct an approach for persistent-session evaluation of GitHub issues anchored at a single base commit. We compare our non-linear domain-agent file traversal system against a base LLM without direct repository access, a single agent Recursive Language Model (RLM) baseline with a persistent Python REPL and an external CLI baseline using Codex 5.5 High. Domain scoped parallel agent spawning with a small Haiku-class model achieves the highest micro F1 among Haiku class models by a large margin. Domain-agents is the second highest behind only the much larger Codex 5.5 High on our own expanded benchmark including over more recent PRs from 2025 and 2026. On the original, curated, 2020 SWE-bench Pro benchmark, a larger Sonnet plain LLM baseline attains higher micro F1 by predicting few files, leading to higher precision, but at significantly lower all gold recall. We also present three additional findings. First, documentation evolution is a latent dependency unresolved by any approach. Second, naive file system access can degrade localization driven by test-file over prediction. Lastly, forced multi-agent consultation does not measurably help and raises token cost substantially.

翻译：软件工程工具日益依赖基于LLM的代理来定位需要变更的文件以解决软件问题。大多数人工智能代理线性探索仓库，即每一步访问一个目录或文件。我们假设，对于跨多个子系统的变更来说，这是一种结构上的不匹配。我们比较了线性顺序探索与非线性、领域限定范围的并行代理探索。以SWE Bench Pro作为初始基准，我们聚焦于ansible作为示例。我们构建了一种方法，用于基于单个基础提交的GitHub问题的持久会话评估。我们将我们的非线性领域代理文件遍历系统与没有直接仓库访问权限的基线LLM、具有持久Python REPL的单代理递归语言模型基线以及使用Codex 5.5 High的外部CLI基线进行了比较。使用小型Haiku类模型进行领域限定范围的并行代理生成，在Haiku类模型中取得了最高的微平均F1值，且优势显著。在我们自建的、包含2025和2026年更近期PR的扩展基准上，领域代理排名第二，仅次于规模大得多的Codex 5.5 High。在原始、精选的2020年SWE-bench Pro基准上，较大的Sonnet纯LLM基线通过预测少量文件实现了更高的微平均F1值，从而获得了更高的精确度，但所有黄金召回率显著较低。我们还提出了三项额外发现。首先，文档演化是所有方法都未解决的潜在依赖项。其次，简单的文件系统访问可能因过度预测测试文件而降低定位效果。最后，强制性的多代理协作并无显著帮助，反而会大幅增加令牌成本。