The reliance of popular programming languages such as Python and JavaScript on centralized package repositories and open-source software, combined with the emergence of code-generating Large Language Models (LLMs), has created a new type of threat to the software supply chain: package hallucinations. These hallucinations, which arise from fact-conflicting errors when generating code using LLMs, represent a novel form of package confusion attack that poses a critical threat to the integrity of the software supply chain. This paper conducts a rigorous and comprehensive evaluation of package hallucinations across different programming languages, settings, and parameters, exploring how a diverse set of models and configurations affect the likelihood of generating erroneous package recommendations and identifying the root causes of this phenomenon. Using 16 popular LLMs for code generation and two unique prompt datasets, we generate 576,000 code samples in two programming languages that we analyze for package hallucinations. Our findings reveal that that the average percentage of hallucinated packages is at least 5.2% for commercial models and 21.7% for open-source models, including a staggering 205,474 unique examples of hallucinated package names, further underscoring the severity and pervasiveness of this threat. To overcome this problem, we implement several hallucination mitigation strategies and show that they are able to significantly reduce the number of package hallucinations while maintaining code quality. Our experiments and findings highlight package hallucinations as a persistent and systemic phenomenon while using state-of-the-art LLMs for code generation, and a significant challenge which deserves the research community's urgent attention.
翻译:Python和JavaScript等流行编程语言对集中式软件包仓库和开源软件的依赖,与代码生成大语言模型(LLMs)的兴起相结合,对软件供应链构成了一种新型威胁:包幻觉。这类幻觉源于使用LLMs生成代码时产生的事实冲突错误,代表了一种新型的软件包混淆攻击形式,对软件供应链的完整性构成严重威胁。本文通过严谨全面的评估方法,在不同编程语言、环境设置和参数条件下系统研究了包幻觉现象,探究了多样化模型配置如何影响生成错误软件包推荐的概率,并揭示了该现象的根本成因。我们使用16个主流代码生成LLMs和两个独特的提示数据集,生成了包含两种编程语言的576,000个代码样本进行包幻觉分析。研究发现:商业模型的平均包幻觉比例至少为5.2%,开源模型则高达21.7%,其中包含多达205,474个独特的幻觉包名实例,进一步凸显了该威胁的严重性与普遍性。为应对此问题,我们实施了多种幻觉缓解策略,证明这些策略能在保持代码质量的同时显著降低包幻觉数量。我们的实验与发现表明,包幻觉是使用前沿代码生成LLMs时持续存在的系统性现象,也是值得研究界紧急关注的重要挑战。