Parameter-efficient fine-tuning (PEFT) enables efficient adaptation of pre-trained language models (PLMs) to specific tasks. By tuning only a minimal set of (extra) parameters, PEFT achieves performance comparable to full fine-tuning. However, despite its prevalent use, the security implications of PEFT remain largely unexplored. In this paper, we conduct a pilot study revealing that PEFT exhibits unique vulnerability to trojan attacks. Specifically, we present PETA, a novel attack that accounts for downstream adaptation through bilevel optimization: the upper-level objective embeds the backdoor into a PLM while the lower-level objective simulates PEFT to retain the PLM's task-specific performance. With extensive evaluation across a variety of downstream tasks and trigger designs, we demonstrate PETA's effectiveness in terms of both attack success rate and unaffected clean accuracy, even after the victim user performs PEFT over the backdoored PLM using untainted data. Moreover, we empirically provide possible explanations for PETA's efficacy: the bilevel optimization inherently 'orthogonalizes' the backdoor and PEFT modules, thereby retaining the backdoor throughout PEFT. Based on this insight, we explore a simple defense that omits PEFT in selected layers of the backdoored PLM and unfreezes a subset of these layers' parameters, which is shown to effectively neutralize PETA.
翻译:参数高效微调(PEFT)能够高效地将预训练语言模型(PLM)适配到特定任务,仅需调整少量(额外)参数即可达到与全参数微调相当的性能。然而,尽管PEFT应用广泛,其安全性隐患仍未充分探索。本文通过一项初步研究揭示了PEFT对木马攻击存在独特脆弱性。具体而言,我们提出了PETA——一种通过双层优化考虑下游适配的新型攻击方法:上层目标将后门注入PLM,下层目标则模拟PEFT以保持PLM在特定任务上的性能。通过在多种下游任务和触发器设计上的广泛评估,我们证明了PETA在攻击成功率和未受影响的干净准确率方面的有效性——即使受害用户使用未受污染数据对带后门的PLM进行PEFT后依然有效。此外,我们通过实验为PETA的有效性提供了可能的解释:双层优化本质上使后门模块与PEFT模块"正交化",从而在PEFT过程中保持后门。基于这一发现,我们探索了一种简单防御策略——在带后门PLM的特定层级中省略PEFT并解冻这些层级的部分参数,实验表明该方法能有效消除PETA攻击。