Large-scale language models have achieved tremendous success across various natural language processing (NLP) applications. Nevertheless, language models are vulnerable to backdoor attacks, which inject stealthy triggers into models for steering them to undesirable behaviors. Most existing backdoor attacks, such as data poisoning, require further (re)training or fine-tuning language models to learn the intended backdoor patterns. The additional training process however diminishes the stealthiness of the attacks, as training a language model usually requires long optimization time, a massive amount of data, and considerable modifications to the model parameters. In this work, we propose Training-Free Lexical Backdoor Attack (TFLexAttack) as the first training-free backdoor attack on language models. Our attack is achieved by injecting lexical triggers into the tokenizer of a language model via manipulating its embedding dictionary using carefully designed rules. These rules are explainable to human developers which inspires attacks from a wider range of hackers. The sparse manipulation of the dictionary also habilitates the stealthiness of our attack. We conduct extensive experiments on three dominant NLP tasks based on nine language models to demonstrate the effectiveness and universality of our attack. The code of this work is available at https://github.com/Jinxhy/TFLexAttack.
翻译:大规模语言模型在各类自然语言处理(NLP)应用中取得了巨大成功。然而,语言模型容易受到后门攻击,攻击者会向模型中注入隐蔽触发器,引导模型产生不良行为。现有的大多数后门攻击(例如数据投毒)需要进一步(重新)训练或微调语言模型,以学习预定的后门模式。然而,额外的训练过程会降低攻击的隐蔽性,因为训练语言模型通常需要较长的优化时间、大量数据以及对模型参数进行显著修改。在这项工作中,我们提出了无需训练的词汇后门攻击(TFLexAttack),这是首个无需训练的后门攻击语言模型方法。我们的攻击通过精心设计的规则操纵语言模型分词器的嵌入字典,向其中注入词汇触发器来实现。这些规则对开发人员具有可解释性,从而启发更广泛的黑客群体实施攻击。字典的稀疏操纵也增强了我们攻击的隐蔽性。我们基于九个语言模型在三个主流NLP任务上进行了大量实验,以证明我们攻击的有效性和普适性。本工作的代码可在 https://github.com/Jinxhy/TFLexAttack 获取。