Adversarial attacks in Natural Language Processing apply perturbations in the character or token levels. Token-level attacks, gaining prominence for their use of gradient-based methods, are susceptible to altering sentence semantics, leading to invalid adversarial examples. While character-level attacks easily maintain semantics, they have received less attention as they cannot easily adopt popular gradient-based methods, and are thought to be easy to defend. Challenging these beliefs, we introduce Charmer, an efficient query-based adversarial attack capable of achieving high attack success rate (ASR) while generating highly similar adversarial examples. Our method successfully targets both small (BERT) and large (Llama 2) models. Specifically, on BERT with SST-2, Charmer improves the ASR in 4.84% points and the USE similarity in 8% points with respect to the previous art. Our implementation is available in https://github.com/LIONS-EPFL/Charmer.
翻译:自然语言处理中的对抗攻击通常在字符或词级别上施加扰动。词级别攻击因采用基于梯度的方法而逐渐受到关注,但其容易改变句子语义,导致无效的对抗样本。字符级攻击虽然易于保持语义,但由于难以直接应用流行的基于梯度的方法,且被认为易于防御,因而受到的关注较少。针对这些观点,我们提出Charmer——一种高效的基于查询的对抗攻击方法,能够在生成高度相似的对抗样本的同时实现高攻击成功率(ASR)。我们的方法成功攻击了小模型(BERT)和大模型(Llama 2)。具体而言,在SST-2数据集上针对BERT模型,Charmer相对于先前最优方法将ASR提升了4.84个百分点,USE相似度提升了8个百分点。我们的实现代码见https://github.com/LIONS-EPFL/Charmer。