The field of adversarial textual attack has significantly grown over the last few years, where the commonly considered objective is to craft adversarial examples (AEs) that can successfully fool the target model. However, the imperceptibility of attacks, which is also essential for practical attackers, is often left out by previous studies. In consequence, the crafted AEs tend to have obvious structural and semantic differences from the original human-written text, making them easily perceptible. In this work, we advocate leveraging multi-objectivization to address such issue. Specifically, we reformulate the problem of crafting AEs as a multi-objective optimization problem, where the attack imperceptibility is considered as an auxiliary objective. Then, we propose a simple yet effective evolutionary algorithm, dubbed HydraText, to solve this problem. To the best of our knowledge, HydraText is currently the only approach that can be effectively applied to both score-based and decision-based attack settings. Exhaustive experiments involving 44237 instances demonstrate that HydraText consistently achieves competitive attack success rates and better attack imperceptibility than the recently proposed attack approaches. A human evaluation study also shows that the AEs crafted by HydraText are more indistinguishable from human-written text. Finally, these AEs exhibit good transferability and can bring notable robustness improvement to the target model by adversarial training.
翻译:近年来,对抗性文本攻击领域发展迅速,其常见目标是生成能够成功欺骗目标模型的对抗样本。然而,攻击的隐蔽性——这对实际攻击者同样至关重要——却常被先前研究忽视。因此,生成的对抗样本往往与原始人工文本存在明显的结构和语义差异,容易被察觉。本研究提出利用多目标优化来解决此问题。具体而言,我们将生成对抗样本的问题重新表述为多目标优化问题,并将攻击隐蔽性视为辅助目标。随后,我们提出一种简单而有效的进化算法HydraText来求解该问题。据我们所知,HydraText是目前唯一能有效应用于基于得分和基于决策两种攻击场景的方法。涉及44237个实例的全面实验表明,HydraText在保持竞争力攻击成功率的同时,其攻击隐蔽性持续优于近期提出的攻击方法。人工评估研究也显示,HydraText生成的对抗样本与人工文本的可区分度更低。最后,这些对抗样本展现出良好的迁移性,并可通过对抗训练显著提升目标模型的鲁棒性。