Large Language Models (LLMs) have shown great potential in code generation. However, current LLMs still cannot reliably generate correct code. Moreover, it is unclear what kinds of code generation errors LLMs can make. To address this, we conducted an empirical study to analyze incorrect code snippets generated by six popular LLMs on the HumanEval dataset. We analyzed these errors alongside two dimensions of error characteristics -- semantic characteristics and syntactic characteristics -- to derive a comprehensive code generation error taxonomy for LLMs through open coding and thematic analysis. We then labeled all 557 incorrect code snippets based on this taxonomy. Our results showed that the six LLMs exhibited similar distributions of syntactic characteristics while different distributions of semantic characteristics. Furthermore, we analyzed the correlation between different error characteristics and factors such as task complexity, code length, and test-pass rate. Finally, we highlight the challenges that LLMs may encounter when generating code and propose implications for future research on reliable code generation with LLMs.
翻译:大型语言模型(LLMs)在代码生成领域展现出巨大潜力。然而,当前LLMs仍无法可靠生成正确代码,且其可能产生的代码错误类型尚不明确。为此,我们开展实证研究,通过分析六种主流LLMs在HumanEval数据集上生成的错误代码片段,从语义特征和句法特征两个维度对错误特性进行系统考察。采用开放式编码与主题分析方法,我们构建了面向LLMs的完整代码生成错误分类体系,并据此对全部557个错误代码片段进行标注。研究结果表明:六种LLMs在句法特征上呈现相似分布,而在语义特征上表现出显著差异。进一步地,我们分析了不同错误特征与任务复杂度、代码长度及测试通过率等因素的相关性。最后,本文揭示了LLMs在代码生成过程中可能面临的核心挑战,并为未来实现可靠的LLM代码生成研究提出建设性意见。