Text-CRS: A Generalized Certified Robustness Framework against Textual Adversarial Attacks

The language models, especially the basic text classification models, have been shown to be susceptible to textual adversarial attacks such as synonym substitution and word insertion attacks. To defend against such attacks, a growing body of research has been devoted to improving the model robustness. However, providing provable robustness guarantees instead of empirical robustness is still widely unexplored. In this paper, we propose Text-CRS, a generalized certified robustness framework for natural language processing (NLP) based on randomized smoothing. To our best knowledge, existing certified schemes for NLP can only certify the robustness against $\ell_0$ perturbations in synonym substitution attacks. Representing each word-level adversarial operation (i.e., synonym substitution, word reordering, insertion, and deletion) as a combination of permutation and embedding transformation, we propose novel smoothing theorems to derive robustness bounds in both permutation and embedding space against such adversarial operations. To further improve certified accuracy and radius, we consider the numerical relationships between discrete words and select proper noise distributions for the randomized smoothing. Finally, we conduct substantial experiments on multiple language models and datasets. Text-CRS can address all four different word-level adversarial operations and achieve a significant accuracy improvement. We also provide the first benchmark on certified accuracy and radius of four word-level operations, besides outperforming the state-of-the-art certification against synonym substitution attacks.

翻译：语言模型，尤其是基础的文本分类模型，已被证明容易受到同义词替换和词语插入等文本对抗攻击。为防御此类攻击，越来越多的研究致力于提升模型的鲁棒性。然而，提供可证明的鲁棒性保证而非经验性鲁棒性，目前仍鲜有探索。本文提出Text-CRS，一个基于随机平滑技术的、用于自然语言处理（NLP）的广义可认证鲁棒性框架。据我们所知，现有的NLP可认证方案仅能认证针对同义词替换攻击中$\ell_0$扰动的鲁棒性。通过将每个词级对抗操作（即同义词替换、词语重排序、插入和删除）表示为排列与嵌入变换的组合，我们提出了新的平滑定理，以推导出针对此类对抗操作在排列空间和嵌入空间中的鲁棒性边界。为进一步提升可认证准确率和半径，我们考虑了离散词语间的数值关系，并为随机平滑选择了合适的噪声分布。最后，我们在多个语言模型和数据集上进行了大量实验。Text-CRS能够处理所有四种不同的词级对抗操作，并实现了显著的准确率提升。除了在同义词替换攻击的可认证性能上超越现有最佳方案外，我们还首次提供了针对四种词级操作的可认证准确率与半径的基准测试结果。