A Trojan in a language model can be inserted when the model is refined for a particular application such as determining the sentiment of product reviews. In this paper, we clarify and empirically explore variations of the data-poisoning threat model. We then empirically assess two simple defenses each for a different defense scenario. Finally, we provide a brief survey of related attacks and defenses.
翻译:当语言模型针对特定应用(如确定产品评论的情感倾向)进行精炼时,可能被植入木马。本文首先澄清并实证探索了数据投毒威胁模型的多种变体。随后,我们针对两种不同的防御场景,分别对两种简易防御方法进行了实证评估。最后,我们对相关攻击与防御技术进行了简要综述。