Clean-label (CL) attack is a form of data poisoning attack where an adversary modifies only the textual input of the training data, without requiring access to the labeling function. CL attacks are relatively unexplored in NLP, as compared to label flipping (LF) attacks, where the latter additionally requires access to the labeling function as well. While CL attacks are more resilient to data sanitization and manual relabeling methods than LF attacks, they often demand as high as ten times the poisoning budget than LF attacks. In this work, we first introduce an Adversarial Clean Label attack which can adversarially perturb in-class training examples for poisoning the training set. We then show that an adversary can significantly bring down the data requirements for a CL attack, using the aforementioned approach, to as low as 20% of the data otherwise required. We then systematically benchmark and analyze a number of defense methods, for both LF and CL attacks, some previously employed solely for LF attacks in the textual domain and others adapted from computer vision. We find that text-specific defenses greatly vary in their effectiveness depending on their properties.
翻译:干净标签(CL)攻击是一种数据投毒攻击形式,攻击者仅修改训练数据的文本输入,无需访问标注函数。相较于需要额外访问标注函数的标签翻转(LF)攻击,CL攻击在自然语言处理领域的研究相对较少。尽管CL攻击比LF攻击对数据清洗和人工重标方法更具鲁棒性,但其所需的投毒预算通常高达LF攻击的十倍。本文首先提出一种对抗性干净标签攻击方法,该方法能够对训练集中类内样本进行对抗性扰动以实现投毒。随后证明,攻击者通过上述方法可将CL攻击所需数据量显著降低至传统需求的20%。我们系统性地对多种防御方法进行了基准测试与分析——这些方法部分来自文本领域先前仅用于LF攻击的技术,部分迁移自计算机视觉领域。研究发现,文本特定防御方法的有效性因其特性不同而存在显著差异。