Large Language Models have shown remarkable capabilities in code generation. However, most existing evaluations focus only on single-attempt accuracy and overlook the iterative refinement process that is central to real-world programming. This study presents a systematic investigation of LLMs' ability to rectify their own code through execution feedback. Using real-world programming problems across four models and two major programming languages, this study evaluates performance using iterative refinement framework where LLMs receive compiler error messages and testcase feedback after each attempt. This study introduces metrics to evaluate code failures, analyze rectification patterns, and compare the effectiveness of reasoning and non-reasoning models, offering actionable insights into both the understanding and practical application of feedback loops in LLM-driven code generation systems. Results show that reasoning models consistently improve over iterations, substantially outperforming non-reasoning models in leveraging feedback, while syntactic and runtime errors are far more tractable than logical or algorithmic failures.
翻译:大语言模型在代码生成方面展现了卓越能力。然而,现有评估大多仅关注单次生成准确率,忽略了实际编程中至关重要的迭代优化过程。本研究系统探究了大语言模型通过执行反馈纠正自身代码的能力。基于涵盖四种模型与两种主流编程语言的真实编程问题,本研究采用迭代优化框架评估性能——该框架中,大语言模型在每次尝试后接收编译器错误信息与测试用例反馈。本研究引入指标评估代码故障、分析纠错模式,并对比推理型与非推理型模型的效能差异,为理解与实践基于大语言模型代码生成系统中的反馈环路提供了可操作洞见。结果表明,推理型模型在迭代过程中持续提升,在利用反馈方面显著优于非推理型模型;语法错误与运行时错误的修正难度远低于逻辑错误与算法错误。