Currently, many large language models (LLMs) are utilized for software engineering tasks such as code generation. The emergence of more advanced models known as large reasoning models (LRMs), such as OpenAI's o3, DeepSeek R1, and Qwen3. They have demonstrated the capability of performing multi-step reasoning. Despite the advancement in LRMs, little attention has been paid to systematically analyzing the reasoning patterns these models exhibit and how such patterns influence the generated code. This paper presents a comprehensive study aimed at investigating and uncovering the reasoning behavior of LRMs during code generation. We prompted several state-of-the-art LRMs of varying sizes with code generation tasks and applied open coding to manually annotate the reasoning traces. From this analysis, we derive a taxonomy of LRM reasoning behaviors, encompassing 15 reasoning actions across four phases. Our empirical study based on the taxonomy reveals a series of findings. First, we identify common reasoning patterns, showing that LRMs generally follow a human-like coding workflow, with more complex tasks eliciting additional actions such as scaffolding, flaw detection, and style checks. Second, we compare reasoning across models, finding that Qwen3 exhibits iterative reasoning while DeepSeek-R1-7B follows a more linear, waterfall-like approach. Third, we analyze the relationship between reasoning and code correctness, showing that actions such as unit test creation and scaffold generation strongly support functional outcomes, with LRMs adapting strategies based on task context. Finally, we evaluate lightweight prompting strategies informed by these findings, demonstrating the potential of context- and reasoning-oriented prompts to improve LRM-generated code. Our results offer insights and practical implications for advancing automatic code generation.
翻译:当前,许多大型语言模型(LLMs)被应用于软件工程任务,如代码生成。更先进的模型——大型推理模型(LRMs)的出现,例如OpenAI的o3、DeepSeek R1和Qwen3,已展现出执行多步推理的能力。尽管LRMs取得了进展,但系统分析这些模型展现的推理模式以及这些模式如何影响生成的代码却鲜有关注。本文提出了一项综合性研究,旨在调查和揭示LRMs在代码生成过程中的推理行为。我们使用代码生成任务提示了多个不同规模的最新LRMs,并应用开放式编码方法手动标注了推理轨迹。通过分析,我们推导出LRM推理行为的分类体系,涵盖四个阶段的15种推理动作。基于该分类体系的实证研究揭示了一系列发现。首先,我们识别了常见的推理模式,表明LRMs通常遵循类人的编码工作流,更复杂的任务会引发额外的动作,如脚手架构建、缺陷检测和风格检查。其次,我们比较了不同模型间的推理方式,发现Qwen3表现出迭代推理,而DeepSeek-R1-7B则遵循更线性、瀑布式的方法。第三,我们分析了推理与代码正确性之间的关系,表明单元测试创建和脚手架生成等动作对功能结果有强支撑作用,且LRMs会根据任务上下文调整策略。最后,我们评估了基于这些发现的轻量级提示策略,证明了面向上下文和推理的提示在改进LRM生成代码方面的潜力。我们的研究结果为推进自动代码生成提供了见解和实践启示。