Backdoor Attacks and Countermeasures in Natural Language Processing Models: A Comprehensive Security Review

Language Models (LMs) are becoming increasingly popular in real-world applications. Outsourcing model training and data hosting to third-party platforms has become a standard method for reducing costs. In such a situation, the attacker can manipulate the training process or data to inject a backdoor into models. Backdoor attacks are a serious threat where malicious behavior is activated when triggers are present, otherwise, the model operates normally. However, there is still no systematic and comprehensive review of LMs from the attacker's capabilities and purposes on different backdoor attack surfaces. Moreover, there is a shortage of analysis and comparison of the diverse emerging backdoor countermeasures. Therefore, this work aims to provide the NLP community with a timely review of backdoor attacks and countermeasures. According to the attackers' capability and affected stage of the LMs, the attack surfaces are formalized into four categorizations: attacking the pre-trained model with fine-tuning (APMF) or parameter-efficient fine-tuning (APMP), attacking the final model with training (AFMT), and attacking Large Language Models (ALLM). Thus, attacks under each categorization are combed. The countermeasures are categorized into two general classes: sample inspection and model inspection. Thus, we review countermeasures and analyze their advantages and disadvantages. Also, we summarize the benchmark datasets and provide comparable evaluations for representative attacks and defenses. Drawing the insights from the review, we point out the crucial areas for future research on the backdoor, especially soliciting more efficient and practical countermeasures.

翻译：语言模型（LMs）在现实世界应用中日渐普及。将模型训练与数据托管外包给第三方平台已成为降低成本的常规做法。在此背景下，攻击者可通过操纵训练过程或数据向模型中注入后门。后门攻击是一种严重的安全威胁：当特定触发器出现时恶意行为被激活，否则模型表现正常。然而，目前仍缺乏从攻击者能力与意图角度对不同后门攻击面进行系统化全面综述的研究，且对各类新兴后门防御方法的分析与比较亦显不足。为此，本文旨在为自然语言处理领域提供关于后门攻击与防御的及时综述。根据攻击者能力及语言模型受影响阶段，我们将攻击面形式化为四类：针对预训练模型结合微调的攻击（APMF）、针对预训练模型结合参数高效微调的攻击（APMP）、针对最终训练模型的攻击（AFMT）以及针对大语言模型的攻击（ALLM），并系统梳理了各类攻击方法。防御措施归纳为两大方向：样本检测与模型检测，进而评述各类防御技术并分析其优缺点。同时，我们汇总了基准数据集，并对代表性攻击与防御方法进行了可比性评估。基于综述所得启示，我们指出了后门安全未来研究的关键方向，特别是呼吁发展更高效实用的防御机制。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日