Textual backdoor attacks pose a practical threat to existing systems, as they can compromise the model by inserting imperceptible triggers into inputs and manipulating labels in the training dataset. With cutting-edge generative models such as GPT-4 pushing rewriting to extraordinary levels, such attacks are becoming even harder to detect. We conduct a comprehensive investigation of the role of black-box generative models as a backdoor attack tool, highlighting the importance of researching relative defense strategies. In this paper, we reveal that the proposed generative model-based attack, BGMAttack, could effectively deceive textual classifiers. Compared with the traditional attack methods, BGMAttack makes the backdoor trigger less conspicuous by leveraging state-of-the-art generative models. Our extensive evaluation of attack effectiveness across five datasets, complemented by three distinct human cognition assessments, reveals that Figure 4 achieves comparable attack performance while maintaining superior stealthiness relative to baseline methods.
翻译:文本后门攻击对现有系统构成了实际威胁,它通过向输入中插入难以察觉的触发器并操纵训练数据集中的标签来破坏模型。随着GPT-4等前沿生成模型将文本改写能力推向极致,此类攻击变得更加难以检测。我们对黑盒生成模型作为后门攻击工具的作用进行了全面研究,强调了研究相关防御策略的重要性。本文揭示了所提出的基于生成模型的攻击方法BGMAttack能够有效欺骗文本分类器。与传统攻击方法相比,BGMAttack通过利用最先进的生成模型使后门触发器更不显眼。我们在五个数据集上对攻击有效性进行了广泛评估,并辅以三项不同的人类认知评估,结果显示图4在保持相对于基线方法更优隐蔽性的同时,实现了可比的攻击性能。