Recent advances in language model (LM) agents have significantly improved automated software engineering (SWE). Prior work has proposed various agentic workflows and training strategies as well as analyzed failure modes of agentic systems on SWE tasks, focusing on several contextual information signals: Reproduction Test, Regression Test, Edit Location, Execution Context, and API Usage. However, the individual contribution of each signal to overall success remains underexplored, particularly their ideal contribution when intermediate information is perfectly obtained. To address this gap, we introduce Oracle-SWE, a unified method to isolate and extract oracle information signals from SWE benchmarks and quantify the impact of each signal on agent performance. To further validate the pattern, we evaluate the performance gain of signals extracted by strong LMs when provided to a base agent, approximating real-world task-resolution settings. These evaluations aim to guide research prioritization for autonomous coding systems.
翻译:近年来,语言模型智能体在自动化软件工程领域取得了显著进展。以往研究提出了多种智能体工作流与训练策略,并分析了智能体系统在SWE任务中的失败模式,重点关注以下情境信息信号:复现测试、回归测试、编辑位置、执行上下文及API使用。然而,每种信号对整体成功的个体贡献尚未得到充分探索,尤其是在中间信息被完美获取时的理想贡献。为弥补这一空白,我们提出Oracle-SWE——一种从SWE基准中分离与提取Oracle信息信号、并量化各信号对智能体性能影响的统一方法。为进一步验证模式,我们评估了由强语言模型提取的信号在提供给基础智能体时的性能增益,从而近似真实世界的任务解决场景。这些评估旨在为自主编码系统的研究优先级提供指导。