During fine-tuning, large language models (LLMs) are increasingly vulnerable to data-poisoning backdoor attacks, which compromise their reliability and trustworthiness. However, existing defense strategies suffer from limited generalization: they only work on specific attack types or task settings. In this study, we propose Poison-to-Poison (P2P), a general and effective backdoor defense algorithm. P2P injects benign triggers with safe alternative labels into a subset of training samples and fine-tunes the model on this re-poisoned dataset by leveraging prompt-based learning. This enforces the model to associate trigger-induced representations with safe outputs, thereby overriding the effects of original malicious triggers. Thanks to this robust and generalizable trigger-based fine-tuning, P2P is effective across task settings and attack types. Theoretically and empirically, we show that P2P can neutralize malicious backdoors while preserving task performance. We conduct extensive experiments on classification, mathematical reasoning, and summary generation tasks, involving multiple state-of-the-art LLMs. The results demonstrate that our P2P algorithm significantly reduces the attack success rate compared with baseline models. We hope that the P2P can serve as a guideline for defending against backdoor attacks and foster the development of a secure and trustworthy LLM community.
翻译:在微调过程中,大语言模型(LLMs)日益容易受到数据投毒后门攻击的威胁,这损害了其可靠性与可信度。然而,现有的防御策略存在泛化能力不足的问题:它们仅对特定攻击类型或任务设置有效。在本研究中,我们提出Poison-to-Poison(P2P),一种通用且有效的后门防御算法。P2P将带有安全替代标签的良性触发器注入部分训练样本,并利用基于提示的学习在此重新投毒的数据集上对模型进行微调。这迫使模型将触发器诱导的表征与安全输出相关联,从而覆盖原始恶意触发器的影响。得益于这种鲁棒且可泛化的基于触发器的微调机制,P2P能够跨任务设置和攻击类型保持有效性。我们从理论和实验上证明,P2P能够在不影响任务性能的前提下消除恶意后门。我们在分类、数学推理和摘要生成任务上进行了广泛实验,涉及多种最先进的大语言模型。结果表明,与基线模型相比,我们的P2P算法显著降低了攻击成功率。我们希望P2P能为防御后门攻击提供指导,并促进安全可信的大语言模型社区的发展。