Parameter-efficient fine-tuning (PEFT) enables efficient adaptation of pre-trained language models (PLMs) to specific tasks. By tuning only a minimal set of (extra) parameters, PEFT achieves performance comparable to full fine-tuning. However, despite its prevalent use, the security implications of PEFT remain largely unexplored. In this paper, we conduct a pilot study revealing that PEFT exhibits unique vulnerability to trojan attacks. Specifically, we present PETA, a novel attack that accounts for downstream adaptation through bilevel optimization: the upper-level objective embeds the backdoor into a PLM while the lower-level objective simulates PEFT to retain the PLM's task-specific performance. With extensive evaluation across a variety of downstream tasks and trigger designs, we demonstrate PETA's effectiveness in terms of both attack success rate and unaffected clean accuracy, even after the victim user performs PEFT over the backdoored PLM using untainted data. Moreover, we empirically provide possible explanations for PETA's efficacy: the bilevel optimization inherently 'orthogonalizes' the backdoor and PEFT modules, thereby retaining the backdoor throughout PEFT. Based on this insight, we explore a simple defense that omits PEFT in selected layers of the backdoored PLM and unfreezes a subset of these layers' parameters, which is shown to effectively neutralize PETA.
翻译:参数高效微调(PEFT)能够将预训练语言模型(PLM)高效适配至特定任务。通过仅调整极小(额外)参数集,PEFT可实现与全参数微调相当的性能。然而,尽管其应用广泛,PEFT的安全隐患仍未得到充分探索。本文开展了一项先导性研究,揭示PEFT对木马攻击存在独特脆弱性。具体而言,我们提出了PETA——一种通过双层优化实现下游任务适配的新型攻击方法:上层目标将后门嵌入PLM,下层目标模拟PEFT以维持PLM的任务特定性能。通过对多种下游任务和触发器设计的广泛评估,我们证明了PETA在攻击成功率和未受损干净准确率方面的有效性,即使用户使用干净数据对植入后门的PLM进行PEFT后仍能维持攻击效果。此外,我们通过实验为PETA的有效性提供了合理解释:双层优化本质上是将后门与PEFT模块"正交化",从而在PEFT过程中保留后门。基于这一发现,我们探索了一种简单防御方法——在受后门污染的PLM的选定层级中省略PEFT,并解冻这些层级的部分参数,实验证明该方法能有效中和PETA。