Parameter-efficient fine-tuning (PEFT) has become a key training strategy for large language models. However, its reliance on fewer trainable parameters poses security risks, such as task-agnostic backdoors. Despite their severe impact on a wide range of tasks, there is no practical defense solution available that effectively counters task-agnostic backdoors within the context of PEFT. In this study, we introduce Obliviate, a PEFT-integrable backdoor defense. We develop two techniques aimed at amplifying benign neurons within PEFT layers and penalizing the influence of trigger tokens. Our evaluations across three major PEFT architectures show that our method can significantly reduce the attack success rate of the state-of-the-art task-agnostic backdoors (83.6%$\downarrow$). Furthermore, our method exhibits robust defense capabilities against both task-specific backdoors and adaptive attacks. Source code will be obtained at https://github.com/obliviateARR/Obliviate.
翻译:参数高效微调已成为大型语言模型的关键训练策略。然而,其依赖较少可训练参数的特点带来了安全风险,例如任务无关后门。尽管这类后门对广泛任务具有严重影响,但目前尚无在PEFT背景下有效应对任务无关后门的实用防御方案。本研究提出遗忘咒——一种可集成于PEFT的后门防御方法。我们开发了两种技术:增强PEFT层中的良性神经元,以及抑制触发词符的影响。通过对三种主流PEFT架构的评估,我们的方法能将最先进任务无关后门的攻击成功率显著降低83.6%。此外,该方法对任务特定后门和自适应攻击均展现出稳健的防御能力。源代码可通过https://github.com/obliviateARR/Obliviate获取。