面向软件工程任务的代码推理：综述与行动倡议 (Code Reasoning for Software Engineering Tasks: A Survey and A Call to Action)

The rise of large language models (LLMs) has led to dramatic improvements across a wide range of natural language tasks. Their performance on certain tasks can be further enhanced by incorporating test-time reasoning techniques. These inference-time advances have been adopted into the code domain, enabling complex software engineering (SWE) tasks such as code generation, test generation and issue resolution. However, the impact of different reasoning techniques on code-centric SWE tasks has not been systematically explored. In this work, we survey code reasoning techniques that underpin these capabilities, with a focus on test-time compute and inference-time reasoning paradigms. We examine a variety of code-specific reasoning methods and progressively build up to SWE agents, which combine planning, tool use, and multi-step interaction. We also compare the impact of different techniques on coding tasks, highlighting their relative importance and outlining open challenges and future research directions. Our contributions are: (1) to the best of our knowledge, the first dedicated survey of code reasoning for SWE tasks, highlighting overarching reasoning strategies, hybrid methods, and agentic approaches; (2) a taxonomy of inference-time techniques used to drive code reasoning, accompanied by a curated set of under-explored benchmarks with high potential for SWE evaluation; (3) a comparative analysis of reasoning design patterns across commonly used models and benchmarks; and (4) a synthesis of gaps in current methods and evaluation practices, identifying under-explored areas and concrete opportunities for future research.

翻译：大型语言模型（LLM）的兴起已推动自然语言处理各项任务取得显著进展。通过融入测试时推理技术，模型在特定任务上的性能可得到进一步提升。这些推理阶段的技术进展已被引入代码领域，从而能够处理复杂的软件工程（SWE）任务，如代码生成、测试生成与问题修复。然而，不同推理技术对以代码为中心的软件工程任务的影响尚未得到系统性的探索。本文综述了支撑这些能力的代码推理技术，重点关注测试时计算与推理时推理范式。我们考察了多种针对代码的专用推理方法，并逐步延伸至结合了规划、工具使用与多步交互的软件工程智能体。我们还比较了不同技术对编码任务的影响，强调其相对重要性，并指出当前面临的开放挑战与未来研究方向。本文的贡献包括：（1）据我们所知，首次针对软件工程任务的代码推理进行专题综述，系统梳理了整体推理策略、混合方法以及智能体途径；（2）提出了驱动代码推理的推理时技术分类体系，并整理了一套尚未充分探索但具有较高软件工程评估潜力的基准数据集；（3）对常用模型与基准中推理设计模式进行了比较分析；（4）综合分析了现有方法与评估实践中的不足，识别了尚未充分探索的研究领域，并指出了未来研究的具体机遇。