Adversarial attack research in natural language processing (NLP) has made significant progress in designing powerful attack methods and defence approaches. However, few efforts have sought to identify which source samples are the most attackable or robust, i.e. can we determine for an unseen target model, which samples are the most vulnerable to an adversarial attack. This work formally extends the definition of sample attackability/robustness for NLP attacks. Experiments on two popular NLP datasets, four state of the art models and four different NLP adversarial attack methods, demonstrate that sample uncertainty is insufficient for describing characteristics of attackable/robust samples and hence a deep learning based detector can perform much better at identifying the most attackable and robust samples for an unseen target model. Nevertheless, further analysis finds that there is little agreement in which samples are considered the most attackable/robust across different NLP attack methods, explaining a lack of portability of attackability detection methods across attack methods.
翻译:自然语言处理(NLP)中的对抗攻击研究已在设计强大攻击方法和防御手段方面取得显著进展。然而,鲜有研究关注如何识别哪些源样本具有最高可攻击性或鲁棒性,即对于未见过的目标模型,能否确定哪些样本最易受对抗攻击影响。本工作正式拓展了NLP攻击中样本可攻击性/鲁棒性的定义。在两种主流NLP数据集、四种最先进模型及四种不同NLP对抗攻击方法上的实验表明,样本不确定性不足以描述可攻击/鲁棒样本的特征,因此基于深度学习的检测器能更有效地识别未见过目标模型中可攻击性和鲁棒性最强的样本。然而,进一步分析发现,不同NLP攻击方法对哪些样本被视为最可攻击/鲁棒的判定存在较低一致性,这解释了可攻击性检测方法在不同攻击方法间缺乏可移植性的原因。