Large Language Models (LLMs) have shown great potential in code generation. However, current LLMs still cannot reliably generate correct code. Moreover, it is unclear what kinds of code generation errors LLMs can make. To address this, we conducted an empirical study to analyze incorrect code snippets generated by six popular LLMs on the HumanEval dataset. We analyzed these errors alongside two dimensions of error characteristics -- semantic characteristics and syntactic characteristics -- to derive a comprehensive code generation error taxonomy for LLMs through open coding and thematic analysis. We then labeled all 558 incorrect code snippets based on this taxonomy. Our results showed that the six LLMs exhibited different distributions of semantic and syntactic characteristics. Furthermore, we analyzed the correlation between different error characteristics and factors such as prompt length, code length, and test-pass rate. Finally, we highlight the challenges that LLMs may encounter when generating code and propose implications for future research on reliable code generation with LLMs.
翻译:大型语言模型(LLMs)在代码生成方面展现出巨大潜力。然而,当前的LLMs仍无法可靠地生成正确代码。此外,LLMs可能产生何种类型的代码生成错误尚不明确。为此,我们开展了一项实证研究,通过分析六种主流LLMs在HumanEval数据集上生成的错误代码片段,从错误特征的两个维度——语义特征与句法特征——出发,采用开放式编码与主题分析方法,构建了面向LLMs的代码生成错误分类体系。基于该分类体系,我们对全部558个错误代码片段进行了标注。研究结果表明,六种LLMs在语义与句法特征上呈现出不同的分布规律。进一步地,我们分析了不同错误特征与提示长度、代码长度、测试通过率等因素的相关性。最后,我们揭示了LLMs在代码生成中可能面临的挑战,并对未来实现可靠代码生成的LLMs研究方向提出了建议。