CodeHalu: Code Hallucinations in LLMs Driven by Execution-based Verification

Large Language Models (LLMs) have made significant advancements in the field of code generation, offering unprecedented support for automated programming and assisting developers. However, LLMs sometimes generate code that appears plausible but fails to meet the expected requirements or executes incorrectly. This phenomenon of hallucinations in the coding field has not been explored. To advance the community's understanding and research on code hallucinations in LLMs, we propose a definition method for these hallucinations based on execution verification and introduce the concept of code hallucinations for the first time. We categorize code hallucinations into four main types: mapping, naming, resource, and logic hallucinations, each further divided into different subcategories to better understand and address the unique challenges faced by LLMs during code generation. To systematically evaluate code hallucinations, we propose a dynamic detection algorithm for code hallucinations and construct the CodeHalu benchmark, which includes 8,883 samples from 699 tasks, to actively detect hallucination phenomena in LLMs during programming. We tested 16 popular LLMs on this benchmark to evaluate the frequency and nature of their hallucinations during code generation. The findings reveal significant variations in the accuracy and reliability of LLMs in generating code, highlighting the urgent need to improve models and training methods to ensure the functional correctness and safety of automatically generated code. This study not only classifies and quantifies code hallucinations but also provides insights for future improvements in LLM-based code generation research. The CodeHalu benchmark and code are publicly available at https://github.com/yuchen814/CodeHalu.

翻译：大语言模型（LLM）在代码生成领域取得了显著进展，为自动化编程和开发者支持提供了前所未有的能力。然而，LLM有时会生成看似合理但未能满足预期需求或执行错误的代码，这种编码领域的幻觉现象此前尚未得到深入探索。为促进学界对LLM代码幻觉的理解与研究，我们提出一种基于执行验证的代码幻觉定义方法，并首次引入代码幻觉概念。我们将代码幻觉划分为映射幻觉、命名幻觉、资源幻觉和逻辑幻觉四大类别，每类进一步细分为不同子类，以更清晰地理解并应对LLM在代码生成中面临的特有挑战。为系统评估代码幻觉，我们提出一种动态检测算法，并构建了包含699个任务中8,883个样本的CodeHalu基准测试集，用于主动检测LLM在编程过程中的幻觉现象。我们在该基准上测试了16种主流LLM，以评估其生成代码时幻觉的频率与性质。研究结果表明，LLM在生成代码的准确性与可靠性方面存在显著差异，凸显了改进模型与训练方法以确保自动生成代码功能正确性与安全性的迫切需求。本研究不仅对代码幻觉进行了分类与量化，还为未来基于LLM的代码生成研究提供了改进方向。CodeHalu基准测试与代码已开源发布在https://github.com/yuchen814/CodeHalu。