Large language models (LLMs) are increasingly used to generate code, yet they continue to hallucinate, often inventing non-existent libraries. Such library hallucinations are not just benign errors: they can mislead developers, break builds, and expose systems to supply chain threats such as slopsquatting. Despite increasing awareness of these risks, little is known about how real-world prompt variations affect hallucination rates. Therefore, we present the first systematic study of how user-level prompt variations impact library hallucinations in LLM-generated code. We evaluate seven diverse LLMs across two hallucination types: library name hallucinations (invalid imports) and library member hallucinations (invalid calls from valid libraries). We investigate how realistic user language extracted from developer forums and how user errors of varying degrees (one- or multi-character misspellings and completely fake names/members) affect LLM hallucination rates. Our findings reveal systemic vulnerabilities: one-character misspellings in library names trigger hallucinations in up to 26% of tasks, fake library names are accepted in up to 99% of tasks, and time-related prompts lead to hallucinations in up to 84% of tasks. Prompt engineering shows promise for mitigating hallucinations, but remains inconsistent and LLM-dependent. Our results underscore the fragility of LLMs to natural prompt variation and highlight the urgent need for safeguards against library-related hallucinations and their potential exploitation.
翻译:大语言模型(LLMs)正日益用于生成代码,但它们仍持续产生幻觉,常常虚构不存在的库。此类库幻觉不仅仅是良性错误:它们可能误导开发者、破坏构建流程,并使系统面临供应链威胁(如仿冒包攻击)。尽管对这些风险的认识不断提高,但关于现实世界中提示词变化如何影响幻觉率的研究仍十分有限。为此,我们首次系统研究了用户级提示词变化如何影响LLM生成代码中的库幻觉。我们评估了七种不同的LLM在两种幻觉类型上的表现:库名幻觉(无效导入)和库成员幻觉(有效库中的无效调用)。我们探究了从开发者论坛提取的真实用户语言,以及不同程度用户错误(单字符或多字符拼写错误及完全虚构的名称/成员)如何影响LLM的幻觉率。我们的研究结果揭示了系统性漏洞:库名中的单字符拼写错误在高达26%的任务中会触发幻觉,虚构库名在高达99%的任务中被接受,而与时间相关的提示词在高达84%的任务中导致幻觉。提示工程在缓解幻觉方面显示出潜力,但其效果仍不稳定且依赖于具体LLM。我们的结果凸显了LLM对自然提示词变化的脆弱性,并强调了建立针对库相关幻觉及其潜在利用的防护措施的紧迫性。