Despite their success, large language models (LLMs) face the critical challenge of hallucinations, generating plausible but incorrect content. While much research has focused on hallucinations in multiple modalities including images and natural language text, less attention has been given to hallucinations in source code, which leads to incorrect and vulnerable code that causes significant financial loss. To pave the way for research in LLMs' hallucinations in code, we introduce Collu-Bench, a benchmark for predicting code hallucinations of LLMs across code generation (CG) and automated program repair (APR) tasks. Collu-Bench includes 13,234 code hallucination instances collected from five datasets and 11 diverse LLMs, ranging from open-source models to commercial ones. To better understand and predict code hallucinations, Collu-Bench provides detailed features such as the per-step log probabilities of LLMs' output, token types, and the execution feedback of LLMs' generated code for in-depth analysis. In addition, we conduct experiments to predict hallucination on Collu-Bench, using both traditional machine learning techniques and neural networks, which achieves 22.03 -- 33.15% accuracy. Our experiments draw insightful findings of code hallucination patterns, reveal the challenge of accurately localizing LLMs' hallucinations, and highlight the need for more sophisticated techniques.
翻译:尽管大型语言模型(LLMs)取得了成功,但它们面临着幻觉这一关键挑战,即生成看似合理但实际错误的内容。虽然大量研究关注了图像和自然语言文本等多模态中的幻觉问题,但对源代码中幻觉的关注较少,而后者会导致生成错误且易受攻击的代码,造成重大经济损失。为推进LLMs在代码中幻觉的研究,我们提出了Collu-Bench,这是一个用于预测LLMs在代码生成(CG)和自动程序修复(APR)任务中代码幻觉的基准。Collu-Bench包含从五个数据集和11个多样化LLMs(涵盖开源模型与商业模型)收集的13,234个代码幻觉实例。为更好地理解和预测代码幻觉,Collu-Bench提供了详细特征,例如LLMs输出的逐步骤对数概率、词元类型以及LLMs生成代码的执行反馈,以支持深入分析。此外,我们使用传统机器学习技术和神经网络在Collu-Bench上进行了幻觉预测实验,取得了22.03%至33.15%的准确率。我们的实验揭示了代码幻觉模式的深刻发现,指出了准确定位LLMs幻觉的挑战,并强调了对更复杂技术的需求。