Deep learning-based natural language processing (NLP) models, particularly pre-trained language models (PLMs), have been revealed to be vulnerable to adversarial attacks. However, the adversarial examples generated by many mainstream word-level adversarial attack models are neither valid nor natural, leading to the loss of semantic maintenance, grammaticality, and human imperceptibility. Based on the exceptional capacity of language understanding and generation of large language models (LLMs), we propose LLM-Attack, which aims at generating both valid and natural adversarial examples with LLMs. The method consists of two stages: word importance ranking (which searches for the most vulnerable words) and word synonym replacement (which substitutes them with their synonyms obtained from LLMs). Experimental results on the Movie Review (MR), IMDB, and Yelp Review Polarity datasets against the baseline adversarial attack models illustrate the effectiveness of LLM-Attack, and it outperforms the baselines in human and GPT-4 evaluation by a significant margin. The model can generate adversarial examples that are typically valid and natural, with the preservation of semantic meaning, grammaticality, and human imperceptibility.
翻译:基于深度学习的自然语言处理(NLP)模型,特别是预训练语言模型(PLMs),已被揭示易受对抗攻击。然而,许多主流词级对抗攻击模型生成的对抗样本既无效也不自然,导致语义保持性、语法正确性和人眼不可察觉性的丧失。基于大型语言模型(LLMs)在语言理解和生成方面的卓越能力,我们提出了LLM-Attack,旨在利用LLMs生成既有效又自然的对抗样本。该方法包括两个阶段:词重要性排序(用于搜索最脆弱的词语)和词同义词替换(用从LLMs获得的同义词替换这些词语)。在电影评论(MR)、IMDB和Yelp评论极性数据集上的实验结果表明,LLM-Attack在对抗基线对抗攻击模型时具有有效性,并且在人类和GPT-4评估中显著优于基线方法。该模型能够生成通常有效且自然的对抗样本,同时保持语义含义、语法正确性和人眼不可察觉性。