Recent studies on adversarial examples expose vulnerabilities of natural language processing (NLP) models. Existing techniques for generating adversarial examples are typically driven by deterministic hierarchical rules that are agnostic to the optimal adversarial examples, a strategy that often results in adversarial samples with a suboptimal balance between magnitudes of changes and attack successes. To this end, in this research we propose two algorithms, Reversible Jump Attack (RJA) and Metropolis-Hasting Modification Reduction (MMR), to generate highly effective adversarial examples and to improve the imperceptibility of the examples, respectively. RJA utilizes a novel randomization mechanism to enlarge the search space and efficiently adapts to a number of perturbed words for adversarial examples. With these generated adversarial examples, MMR applies the Metropolis-Hasting sampler to enhance the imperceptibility of adversarial examples. Extensive experiments demonstrate that RJA-MMR outperforms current state-of-the-art methods in attack performance, imperceptibility, fluency and grammar correctness.
翻译:近期关于对抗样本的研究揭示了自然语言处理模型的脆弱性。现有对抗样本生成技术通常采用确定性层级规则,这类规则无法感知最优对抗样本,导致所生成的对抗样本在修改幅度与攻击成功率之间难以达到最优平衡。为此,本研究提出两种算法——可逆跳跃攻击(RJA)与Metropolis-Hasting修改缩减(MMR),分别用于生成高效对抗样本和提升样本的不可感知性。RJA通过创新性随机化机制扩展搜索空间,并能自适应调整对抗样本的扰动词数量。针对生成的对抗样本,MMR采用Metropolis-Hasting采样器增强其不可感知性。大量实验表明,RJA-MMR在攻击性能、不可感知性、流畅度及语法正确性方面均优于当前最先进方法。