Test-time scaling has been widely adopted to enhance the capabilities of Large Language Model (LLM) agents in software engineering (SWE) tasks. However, the standard approach of repeatedly sampling trajectories from scratch is computationally expensive. While recent methods have attempted to mitigate costs using specialized value agents, they can suffer from model miscalibration and fail to generalize to modern agents that synthesize custom bash scripts as tools. In this paper, we introduce SWE-Replay, the first efficient and generalizable test-time scaling technique for modern agents without reliance on potentially noisy value estimates. SWE-Replay optimizes the scaling process by recycling trajectories from prior trials, dynamically choosing to either explore from scratch or exploit archived experience by branching at critical intermediate steps. This selection of intermediate steps is driven by the potential and reasoning significance of repository exploration, rather than external LLM-based quality estimates. Our evaluation shows that, on SWE-Bench Verified, SWE-Replay consistently outperforms naive scaling, reducing costs by up to 17.4% while maintaining or even improving performance by up to 3.8%. Further evaluation on SWE-Bench Pro and Multilingual validates the generalizability of SWE-Replay, establishing it as a robust foundation for efficient test-time scaling of software engineering agents.
翻译:测试时扩展已被广泛采用,以增强大型语言模型(LLM)代理在软件工程任务中的能力。然而,标准方法——从头开始重复采样轨迹——计算成本高昂。尽管近期方法尝试使用专用价值代理来降低成本,但这些方法可能受模型校准偏差影响,且难以推广至能够合成自定义bash脚本作为工具的现代代理。本文提出SWE-Replay,这是首个高效且可泛化的测试时扩展技术,适用于现代代理,且无需依赖可能存在噪声的价值估计。SWE-Replay通过复用先前试验的轨迹来优化扩展过程,动态选择从头开始探索或在关键中间步骤分支以利用存档经验。这种中间步骤的选择由仓库探索的潜力和推理意义驱动,而非基于外部LLM的质量估计。我们的评估表明,在SWE-Bench Verified上,SWE-Replay始终优于朴素扩展方法,在保持甚至提升性能达3.8%的同时,将成本降低高达17.4%。在SWE-Bench Pro和Multilingual上的进一步评估验证了SWE-Replay的泛化能力,确立了其作为软件工程代理高效测试时扩展的稳健基础。