Parameter-efficient fine-tuning (PEFT) enables efficient adaptation of pre-trained language models (PLMs) to specific tasks. By tuning only a minimal set of (extra) parameters, PEFT achieves performance that is comparable to standard fine-tuning. However, despite its prevalent use, the security implications of PEFT remain largely unexplored. In this paper, we take the initial steps and present PETA, a novel trojan attack that compromises the weights of PLMs by accounting for downstream adaptation through bilevel optimization: the upper-level objective embeds the backdoor into a model while the lower-level objective simulates PEFT to both retain the PLM's task-specific performance and ensure that the backdoor persists after fine-tuning. With extensive evaluation across a variety of downstream tasks and trigger designs, we demonstrate PETA's effectiveness in terms of both attack success rate and clean accuracy, even when the attacker does not have full knowledge of the victim user's training process.
翻译:参数高效微调(PEFT)使得预训练语言模型(PLM)能够高效地适应特定任务。通过仅调整最小(额外)参数集,PEFT实现了与标准微调相当的性能。然而,尽管其广泛使用,PEFT的安全隐患仍鲜有探索。本文迈出初步步伐,提出了一种名为PETA的新型特洛伊攻击方法,通过双层优化考虑下游适应过程来破坏PLM的权重:上层目标将后门嵌入模型中,而下层目标模拟PEFT以同时保持PLM的任务特定性能,并确保后门在微调后持续存在。通过对多种下游任务和触发器设计的广泛评估,我们证明了PETA在攻击成功率和干净准确率方面的有效性,即便攻击者对受害者的训练过程缺乏完全了解时亦是如此。