Debug2Fix: Supercharging Coding Agents with Interactive Debugging Capabilities

While significant progress has been made in automating various aspects of software development through coding agents, there is still significant room for improvement in their bug fixing capabilities. Debugging and investigation of runtime behavior remains largely a manual, developer-driven process. Popular coding agents typically rely on either static analysis of the code or iterative test-fix cycles, which is akin to trial and error debugging. We posit that there is a wealth of rich runtime information that developers routinely access while debugging code, which agents are currently deprived of due to design limitations. Despite how prevalent debuggers are in modern IDEs and command-line tools, they have surprisingly not made their way into coding agents. In this work, we introduce Debug2Fix, a novel framework that incorporates interactive debugging as a core component of a software engineering agent via a subagent architecture. We incorporate debuggers for Java and Python into our agent framework and evaluate against GitBug-Java and SWE-Bench-Live and achieve >20% improvement in performance compared to the baseline for certain models. Furthermore, using our framework, we're able to make weaker models like GPT-5 and Claude Haiku 4.5 match or exceed the performances of stronger models like Claude Sonnet 4.5, showing that better tool design is often just as important as switching to a more expensive model. Finally, we conduct systematic ablations demonstrating the importance of both the subagent architecture and debugger integration.

翻译：尽管通过编码智能体在自动化软件开发的多个方面已取得显著进展，但其错误修复能力仍有巨大的改进空间。运行时行为的调试与调查在很大程度上仍是一个由开发者驱动的手动过程。流行的编码智能体通常依赖于代码的静态分析或迭代的测试-修复循环，这类似于试错式调试。我们认为，开发者在调试代码时通常会访问大量丰富的运行时信息，而现有智能体由于设计限制无法获取这些信息。尽管调试器在现代集成开发环境和命令行工具中已非常普遍，但令人惊讶的是，它们尚未被整合到编码智能体中。在本工作中，我们提出了Debug2Fix，这是一个新颖的框架，通过子智能体架构将交互式调试作为软件工程智能体的核心组件。我们将Java和Python的调试器集成到智能体框架中，并在GitBug-Java和SWE-Bench-Live数据集上进行评估，结果显示相较于基线模型，某些模型的性能提升超过20%。此外，通过使用我们的框架，我们能够让GPT-5和Claude Haiku 4.5等较弱模型达到甚至超越Claude Sonnet 4.5等更强模型的性能，这表明更好的工具设计与切换到更昂贵的模型同等重要。最后，我们进行了系统的消融实验，证明了子智能体架构与调试器集成的双重重要性。