Parameter-efficient fine-tuning (PEFT) has become a key training strategy for large language models. However, its reliance on fewer trainable parameters poses security risks, such as task-agnostic backdoors. Despite their severe impact on a wide range of tasks, there is no practical defense solution available that effectively counters task-agnostic backdoors within the context of PEFT. In this study, we introduce Obliviate, a PEFT-integrable backdoor defense. We develop two techniques aimed at amplifying benign neurons within PEFT layers and penalizing the influence of trigger tokens. Our evaluations across three major PEFT architectures show that our method can significantly reduce the attack success rate of the state-of-the-art task-agnostic backdoors (83.6%$\downarrow$). Furthermore, our method exhibits robust defense capabilities against both task-specific backdoors and adaptive attacks. Source code will be obtained at https://github.com/obliviateARR/Obliviate.
翻译:参数高效微调已成为大语言模型的关键训练策略。然而,其依赖较少可训练参数的特点带来了安全风险,例如任务无关后门。尽管此类后门对广泛任务具有严重影响,但目前尚无在PEFT框架下有效应对任务无关后门的实用防御方案。本研究提出Obliviate——一种可集成于PEFT的后门防御方法。我们开发了两种技术:增强PEFT层中的良性神经元活性,以及抑制触发词符的影响。通过对三种主流PEFT架构的评估,本方法能将最先进任务无关后门的攻击成功率显著降低83.6%。此外,该方法对任务特定后门和自适应攻击均展现出鲁棒的防御能力。源代码可通过https://github.com/obliviateARR/Obliviate获取。