Large Language Models (LLMs) have made significant progress in code generation, providing developers with unprecedented automated programming support. However, LLMs often generate code that is syntactically correct and even semantically plausible but may not execute as expected or meet specified requirements. This phenomenon of hallucinations in the code domain has not been systematically explored. To enhance the community's understanding and research on this issue, we introduce the concept of code hallucinations and propose a classification method for code hallucination based on execution verification. We classify code hallucinations into four main types: mapping, naming, resource, and logic hallucinations, with each category further divided into different subcategories to understand and address the unique challenges faced by LLMs in code generation with finer granularity. Additionally, we develop a dynamic detection algorithm named CodeHalu to quantify code hallucinations and establish the CodeHaluEval benchmark, which includes 8,883 samples from 699 tasks to systematically and quantitatively evaluate code hallucinations. By evaluating 17 popular LLMs on this benchmark, we reveal significant differences in their accuracy and reliability in code generation and provide detailed insights for further improving the code generation capabilities of LLMs. The CodeHalu benchmark and code are publicly available at https://github.com/yuchen814/CodeHalu.
翻译:大型语言模型(LLMs)在代码生成领域取得了显著进展,为开发者提供了前所未有的自动化编程支持。然而,LLMs生成的代码虽然在语法上正确甚至语义上合理,但可能无法按预期执行或满足特定需求。代码领域的这种幻觉现象尚未得到系统性探索。为增强学术界对此问题的理解与研究,我们提出了代码幻觉的概念,并基于执行验证提出了一种代码幻觉分类方法。我们将代码幻觉主要分为四类:映射幻觉、命名幻觉、资源幻觉和逻辑幻觉,每类可进一步细分为不同子类别,从而以更细粒度理解并应对LLMs在代码生成中面临的独特挑战。此外,我们开发了一种名为CodeHalu的动态检测算法以量化代码幻觉,并建立了包含699项任务中8,883个样本的CodeHaluEval基准,用于系统化、定量化评估代码幻觉。通过对17个主流LLMs在该基准上的评估,我们揭示了它们在代码生成准确性与可靠性方面的显著差异,并为进一步提升LLMs的代码生成能力提供了详细洞见。CodeHalu基准及相关代码已公开于https://github.com/yuchen814/CodeHalu。