Large Language Models (LLMs) have shown promising potentials in program generation and no-code automation. However, LLMs are prone to generate hallucinations, i.e., they generate text which sounds plausible but is incorrect. Although there has been a recent surge in research on LLM hallucinations for text generation, similar hallucination phenomenon can happen in code generation. Sometimes the generated code can have syntactical or logical errors as well as more advanced issues like security vulnerabilities, memory leaks, etc. Given the wide adaptation of LLMs to enhance efficiency in code generation and development in general, it becomes imperative to investigate hallucinations in code generation. To the best of our knowledge, this is the first attempt at studying hallucinations in the code generated by LLMs. We start by introducing the code hallucination definition and a comprehensive taxonomy of code hallucination types. We propose the first benchmark CodeMirage dataset for code hallucinations. The benchmark contains 1,137 GPT-3.5 generated hallucinated code snippets for Python programming problems from two base datasets - HumanEval and MBPP. We then propose the methodology for code hallucination detection and experiment with open source LLMs such as CodeLLaMA as well as OpenAI's GPT-3.5 and GPT-4 models using one-shot prompt. We find that GPT-4 performs the best on HumanEval dataset and gives comparable results to the fine-tuned CodeBERT baseline on MBPP dataset. Towards the end, we discuss various mitigation strategies for code hallucinations and conclude our work.
翻译:大语言模型在程序生成和无代码自动化方面展现出巨大潜力。然而,大语言模型容易产生幻觉,即生成看似合理但实际错误的文本。尽管近期针对文本生成中LLM幻觉的研究激增,但类似的幻觉现象同样可能发生在代码生成过程中。生成的代码有时会出现语法或逻辑错误,以及更高级的问题,如安全漏洞、内存泄漏等。鉴于大语言模型在提升代码生成和开发效率方面的广泛应用,研究代码生成中的幻觉现象变得至关重要。据我们所知,这是首次针对大语言模型生成代码中幻觉现象的研究。我们首先提出了代码幻觉的定义,并建立了完整的代码幻觉类型分类体系。我们构建了首个代码幻觉基准数据集CodeMirage,该数据集包含基于HumanEval和MBPP两个基础数据集的1,137个GPT-3.5生成的Python编程问题幻觉代码片段。随后我们提出了代码幻觉检测方法,并采用单样本提示对开源大语言模型(如CodeLLaMA)以及OpenAI的GPT-3.5和GPT-4模型进行了实验。我们发现GPT-4在HumanEval数据集上表现最佳,在MBPP数据集上与微调后的CodeBERT基线模型结果相当。最后,我们讨论了多种缓解代码幻觉的策略并总结了本项工作。