Language Models (LMs) are becoming increasingly popular in real-world applications. Outsourcing model training and data hosting to third-party platforms has become a standard method for reducing costs. In such a situation, the attacker can manipulate the training process or data to inject a backdoor into models. Backdoor attacks are a serious threat where malicious behavior is activated when triggers are present, otherwise, the model operates normally. However, there is still no systematic and comprehensive review of LMs from the attacker's capabilities and purposes on different backdoor attack surfaces. Moreover, there is a shortage of analysis and comparison of the diverse emerging backdoor countermeasures. Therefore, this work aims to provide the NLP community with a timely review of backdoor attacks and countermeasures. According to the attackers' capability and affected stage of the LMs, the attack surfaces are formalized into four categorizations: attacking the pre-trained model with fine-tuning (APMF) or parameter-efficient fine-tuning (APMP), attacking the final model with training (AFMT), and attacking Large Language Models (ALLM). Thus, attacks under each categorization are combed. The countermeasures are categorized into two general classes: sample inspection and model inspection. Thus, we review countermeasures and analyze their advantages and disadvantages. Also, we summarize the benchmark datasets and provide comparable evaluations for representative attacks and defenses. Drawing the insights from the review, we point out the crucial areas for future research on the backdoor, especially soliciting more efficient and practical countermeasures.
翻译:语言模型(LMs)在现实世界应用中日渐普及。将模型训练与数据托管外包给第三方平台已成为降低成本的常规做法。在此背景下,攻击者可通过操纵训练过程或数据向模型中注入后门。后门攻击是一种严重的安全威胁:当特定触发器出现时恶意行为被激活,否则模型表现正常。然而,目前仍缺乏从攻击者能力与意图角度对不同后门攻击面进行系统化全面综述的研究,且对各类新兴后门防御方法的分析与比较亦显不足。为此,本文旨在为自然语言处理领域提供关于后门攻击与防御的及时综述。根据攻击者能力及语言模型受影响阶段,我们将攻击面形式化为四类:针对预训练模型结合微调的攻击(APMF)、针对预训练模型结合参数高效微调的攻击(APMP)、针对最终训练模型的攻击(AFMT)以及针对大语言模型的攻击(ALLM),并系统梳理了各类攻击方法。防御措施归纳为两大方向:样本检测与模型检测,进而评述各类防御技术并分析其优缺点。同时,我们汇总了基准数据集,并对代表性攻击与防御方法进行了可比性评估。基于综述所得启示,我们指出了后门安全未来研究的关键方向,特别是呼吁发展更高效实用的防御机制。