Large language models remain vulnerable to jailbreak backdoor attacks, where adversaries poison safety alignment data to embed hidden triggers that bypass safety mechanisms. Existing defenses often require comprehensive attack information or multiple triggered examples, making them impractical when defenders only observe a single reported failure case without knowing whether it stems from a backdoor attack or a natural alignment bug. This paper presents Patcher, a post-hoc defense framework that repairs backdoored language models using only a single reported failure case and the model parameters. Patcher operates in two stages. First, it localizes backdoor triggers by computing response-conditioned gradient-based saliency scores and applying adaptive clustering to separate triggers from benign context. Second, it patches the model through a constrained fine-tuning objective that breaks the trigger-response association while preserving benign-task utility and robustness to non-triggered jailbreak attacks through KL-divergence constraints. We conduct extensive evaluations across multiple backdoor attack strategies and demonstrate that Patcher successfully localizes triggers and neutralizes backdoors while maintaining model utility. We further show robustness against adaptive attacks designed to evade our defense. This work represents a significant step toward practical defenses against training-time attacks in deployed language models.
翻译:大型语言模型仍易受越狱后门攻击,攻击者通过污染安全对齐数据嵌入隐藏触发器,从而绕过安全机制。现有防御通常需要全面的攻击信息或多个触发示例,这使得在防御方仅观察到单次报告失败案例且无法确定其源于后门攻击还是自然对齐缺陷时,此类方法不切实际。本文提出Patcher——一种事后防御框架,仅利用单次报告失败案例与模型参数即可修复后门语言模型。Patcher分两阶段运行:首先,通过计算基于响应条件梯度显著性的分数,并应用自适应聚类将触发器与良性上下文分离,从而定位后门触发器;其次,通过约束微调目标修补模型,该目标打破触发器-响应关联,同时通过KL散度约束维持良性任务效用及对非触发越狱攻击的鲁棒性。我们针对多种后门攻击策略开展广泛评估,结果表明Patcher能成功定位触发器并消除后门,同时保持模型效用。我们进一步证明了对旨在规避本防御的自适应攻击的鲁棒性。此项工作为部署语言模型中训练时攻击的实际防御迈出了重要一步。