LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning

Hallucinations, outputs that sound plausible but are factually incorrect, remain an open challenge for deployed LLMs. In code generation, models frequently hallucinate non-existent software packages, recommending imports and installation commands for fictional libraries. This creates a critical supply-chain vulnerability: an attacker can proactively register such packages on public registries with malicious payloads that are subsequently installed and executed by developers or autonomous agents, a class of package confusion attack known as slopsquatting. Once a model is deployed, mitigating this failure mode is difficult: full retraining is costly, and existing approaches either cause severe degradation of model utility or rely on a pre-specified forget-set, an assumption that does not apply to the unbounded space of hallucinations. To address this problem, we present Adaptive Unlearning (AU), a post-deployment framework that surgically suppresses hallucinations while preserving general model utility. AU introduces a hybrid token-level objective that simultaneously reinforces valid outputs and suppresses hallucinated ones. Combined with an adaptive discovery loop that continuously surfaces new hallucination-inducing contexts without human supervision, AU enables generalization to unseen prompts and hallucinations. We demonstrate that AU reduces package hallucination rates by 81%, corresponding to a substantial reduction in slopsquatting attack surface, while maintaining performance on standard coding benchmarks. Our analysis shows that distributional changes are concentrated on package-related generations, leaving general coding behavior largely unaffected and confirming that AU's effect is isolated to the targeted distribution. AU operates entirely on model-generated data, requires no human annotation, and generalizes across domains.

翻译：幻觉（即看似合理但事实错误的输出）仍是已部署大语言模型面临的主要挑战。在代码生成任务中，模型常虚构不存在的软件包，推荐虚构库的导入命令与安装指令。这引发了严重的供应链漏洞：攻击者可通过在公共软件仓库预先注册此类虚拟包并植入恶意载荷，诱使开发者或自主代理执行安装——此类包混淆攻击被称为"软件包冒名攻击"。模型部署后，缓解该故障模式十分困难：完全重训练成本高昂，现有方法要么导致模型效用严重退化，要么依赖预设的遗忘集合（该假设不适用于无界幻觉空间）。针对此问题，我们提出自适应忘却（AU）框架——一种能在保持模型通用效用前提下手术式抑制幻觉的部署后方案。AU引入混合令牌级优化目标，同步增强有效输出并压制幻觉输出。结合无需人工监督的自适应发现循环（持续发掘新型幻觉诱导上下文），AU可泛化至未见提示与幻觉场景。实验表明，AU将软件包幻觉率降低81%，显著缩减软件包冒名攻击攻击面，同时保持标准编程基准性能。分析显示，模型分布变化集中于软件包相关生成，通用编码行为基本不受影响，证实AU的效果被精准隔离至目标分布。AU全程基于模型生成数据，无需人工标注，且跨领域泛化。