Gradient-Based Word Substitution for Obstinate Adversarial Examples Generation in Language Models

In this paper, we study the problem of generating obstinate (over-stability) adversarial examples by word substitution in NLP, where input text is meaningfully changed but the model's prediction does not, even though it should. Previous word substitution approaches have predominantly focused on manually designed antonym-based strategies for generating obstinate adversarial examples, which hinders its application as these strategies can only find a subset of obstinate adversarial examples and require human efforts. To address this issue, in this paper, we introduce a novel word substitution method named GradObstinate, a gradient-based approach that automatically generates obstinate adversarial examples without any constraints on the search space or the need for manual design principles. To empirically evaluate the efficacy of GradObstinate, we conduct comprehensive experiments on five representative models (Electra, ALBERT, Roberta, DistillBERT, and CLIP) finetuned on four NLP benchmarks (SST-2, MRPC, SNLI, and SQuAD) and a language-grounding benchmark (MSCOCO). Extensive experiments show that our proposed GradObstinate generates more powerful obstinate adversarial examples, exhibiting a higher attack success rate compared to antonym-based methods. Furthermore, to show the transferability of obstinate word substitutions found by GradObstinate, we replace the words in four representative NLP benchmarks with their obstinate substitutions. Notably, obstinate substitutions exhibit a high success rate when transferred to other models in black-box settings, including even GPT-3 and ChatGPT. Examples of obstinate adversarial examples found by GradObstinate are available at https://huggingface.co/spaces/anonauthors/SecretLanguage.

翻译：本文研究了自然语言处理中通过词替换生成顽固（过度稳定）对抗样本的问题，即输入文本发生有意义变化但模型预测结果未发生变化（尽管本应变化）。以往的词替换方法主要依赖人工设计的基于反义词的策略来生成顽固对抗样本，但这种策略仅能发现部分顽固对抗样本且需要大量人工操作，从而限制了其应用。为解决这一问题，本文提出了一种名为GradObstinate的新型词替换方法，该方法基于梯度自动生成顽固对抗样本，无需对搜索空间施加约束或设计人工规则。为了实证评估GradObstinate的有效性，我们在五个代表性模型（Electra、ALBERT、Roberta、DistillBERT和CLIP）上进行了全面实验，这些模型在四个NLP基准任务（SST-2、MRPC、SNLI和SQuAD）以及一个语言-视觉基准任务（MSCOCO）上进行了微调。大量实验表明，本文提出的GradObstinate能够生成更强的顽固对抗样本，与基于反义词的方法相比具有更高的攻击成功率。此外，为展示GradObstinate发现的顽固词替换的可迁移性，我们将四个代表性NLP基准任务中的单词替换为对应的顽固替换词。值得注意的是，这些顽固替换词在黑盒设置下迁移至其他模型（包括GPT-3和ChatGPT）时表现出高迁移成功率。GradObstinate发现的顽固对抗样本示例可参见https://huggingface.co/spaces/anonauthors/SecretLanguage。