Hallucinations, outputs that sound plausible but are factually incorrect, remain an open challenge for deployed LLMs. In code generation, models frequently hallucinate non-existent software packages, recommending imports and installation commands for fictional libraries. This creates a critical supply-chain vulnerability: an attacker can proactively register such packages on public registries with malicious payloads that are subsequently installed and executed by developers or autonomous agents, a class of package confusion attack known as slopsquatting. Once a model is deployed, mitigating this failure mode is difficult: full retraining is costly, and existing approaches either cause severe degradation of model utility or rely on a pre-specified forget-set, an assumption that does not apply to the unbounded space of hallucinations. To address this problem, we present Adaptive Unlearning (AU), a post-deployment framework that surgically suppresses hallucinations while preserving general model utility. AU introduces a hybrid token-level objective that simultaneously reinforces valid outputs and suppresses hallucinated ones. Combined with an adaptive discovery loop that continuously surfaces new hallucination-inducing contexts without human supervision, AU enables generalization to unseen prompts and hallucinations. We demonstrate that AU reduces package hallucination rates by 81%, corresponding to a substantial reduction in slopsquatting attack surface, while maintaining performance on standard coding benchmarks. Our analysis shows that distributional changes are concentrated on package-related generations, leaving general coding behavior largely unaffected and confirming that AU's effect is isolated to the targeted distribution. AU operates entirely on model-generated data, requires no human annotation, and generalizes across domains.
翻译:幻觉(即看似合理但事实错误的输出)仍是已部署大语言模型面临的主要挑战。在代码生成任务中,模型常虚构不存在的软件包,推荐虚构库的导入命令与安装指令。这引发了严重的供应链漏洞:攻击者可通过在公共软件仓库预先注册此类虚拟包并植入恶意载荷,诱使开发者或自主代理执行安装——此类包混淆攻击被称为"软件包冒名攻击"。模型部署后,缓解该故障模式十分困难:完全重训练成本高昂,现有方法要么导致模型效用严重退化,要么依赖预设的遗忘集合(该假设不适用于无界幻觉空间)。针对此问题,我们提出自适应忘却(AU)框架——一种能在保持模型通用效用前提下手术式抑制幻觉的部署后方案。AU引入混合令牌级优化目标,同步增强有效输出并压制幻觉输出。结合无需人工监督的自适应发现循环(持续发掘新型幻觉诱导上下文),AU可泛化至未见提示与幻觉场景。实验表明,AU将软件包幻觉率降低81%,显著缩减软件包冒名攻击攻击面,同时保持标准编程基准性能。分析显示,模型分布变化集中于软件包相关生成,通用编码行为基本不受影响,证实AU的效果被精准隔离至目标分布。AU全程基于模型生成数据,无需人工标注,且跨领域泛化。