Beyond Functional Correctness: Exploring Hallucinations in LLM-Generated Code

The rise of Large Language Models (LLMs) has significantly advanced various applications on software engineering tasks, particularly in code generation. Despite the promising performance, LLMs are prone to generate hallucinations, which means LLMs might produce outputs that deviate from users' intent, exhibit internal inconsistencies, or misaligned with the real-world knowledge, making the deployment of LLMs potentially risky in a wide range of applications. Existing work mainly focuses on investigating the hallucination in the domain of Natural Language Generation (NLG), leaving a gap in comprehensively understanding the types, causes, and impacts of hallucinations in the context of code generation. To bridge the gap, we conducted a thematic analysis of the LLM-generated code to summarize and categorize the hallucinations, as well as their causes and impacts. Our study established a comprehensive taxonomy of code hallucinations, encompassing 3 primary categories and 12 specific categories. Furthermore, we systematically analyzed the distribution of hallucinations, exploring variations among different LLMs and benchmarks. Moreover, we perform an in-depth analysis on the causes and impacts of various hallucinations, aiming to provide valuable insights into hallucination mitigation. Finally, to enhance the correctness and reliability of LLM-generated code in a lightweight manner, we explore training-free hallucination mitigation approaches by prompt enhancing techniques. We believe our findings will shed light on future research about code hallucination evaluation and mitigation, ultimately paving the way for building more effective and reliable code LLMs in the future. The replication package is available at https://github.com/Lorien1128/code_hallucination

翻译：大型语言模型（LLM）的兴起显著推动了软件工程任务中各类应用的发展，尤其在代码生成领域。尽管性能表现令人鼓舞，LLM仍易于产生幻觉现象，即模型可能生成偏离用户意图、呈现内部不一致或与现实世界知识不符的输出，这使得LLM在广泛应用中的部署存在潜在风险。现有研究主要集中于自然语言生成（NLG）领域的幻觉现象探讨，而在全面理解代码生成背景下幻觉的类型、成因及影响方面仍存在空白。为填补这一空白，我们对LLM生成的代码进行了主题分析，以总结和归类幻觉现象及其成因与影响。本研究建立了完整的代码幻觉分类体系，涵盖3个主要类别和12个具体类别。此外，我们系统分析了幻觉的分布特征，探究了不同LLM与基准测试之间的差异。更进一步，我们对各类幻觉的成因和影响进行了深入分析，旨在为幻觉缓解提供有价值的见解。最后，为以轻量级方式提升LLM生成代码的正确性与可靠性，我们通过提示增强技术探索了无需训练的幻觉缓解方法。我们相信，本研究的发现将为未来关于代码幻觉评估与缓解的研究提供启示，最终为构建更高效可靠的代码生成LLM铺平道路。复现资源包已发布于 https://github.com/Lorien1128/code_hallucination