Text-CRS: A Generalized Certified Robustness Framework against Textual Adversarial Attacks

The language models, especially the basic text classification models, have been shown to be susceptible to textual adversarial attacks such as synonym substitution and word insertion attacks. To defend against such attacks, a growing body of research has been devoted to improving the model robustness. However, providing provable robustness guarantees instead of empirical robustness is still widely unexplored. In this paper, we propose Text-CRS, a generalized certified robustness framework for natural language processing (NLP) based on randomized smoothing. To our best knowledge, existing certified schemes for NLP can only certify the robustness against $\ell_0$ perturbations in synonym substitution attacks. Representing each word-level adversarial operation (i.e., synonym substitution, word reordering, insertion, and deletion) as a combination of permutation and embedding transformation, we propose novel smoothing theorems to derive robustness bounds in both permutation and embedding space against such adversarial operations. To further improve certified accuracy and radius, we consider the numerical relationships between discrete words and select proper noise distributions for the randomized smoothing. Finally, we conduct substantial experiments on multiple language models and datasets. Text-CRS can address all four different word-level adversarial operations and achieve a significant accuracy improvement. We also provide the first benchmark on certified accuracy and radius of four word-level operations, besides outperforming the state-of-the-art certification against synonym substitution attacks.

翻译：语言模型，尤其是基本的文本分类模型，已被证明容易受到诸如同义词替换和词汇插入攻击等文本对抗攻击的影响。为抵御此类攻击，越来越多的研究致力于提升模型鲁棒性。然而，提供可证明的鲁棒性保证而非经验性鲁棒性，仍是一个广泛未被探索的领域。在本文中，我们提出了Text-CRS，一种基于随机平滑的通用认证鲁棒性框架，专为自然语言处理（NLP）设计。据我们所知，现有的NLP认证方案仅能针对同义词替换攻击中的ℓ₀扰动提供鲁棒性认证。通过将每种词级对抗操作（如同义词替换、词汇重排序、插入和删除）表示为排列与嵌入变换的组合，我们提出了新颖的平滑定理，以在排列空间和嵌入空间中推导针对此类对抗操作的鲁棒性边界。为进一步提升认证准确率和半径，我们考虑了离散词汇之间的数值关系，并为随机平滑选择了合适的噪声分布。最后，我们在多种语言模型和数据集上进行了大量实验。Text-CRS能够处理全部四种不同的词级对抗操作，并实现了显著的准确率提升。除了在同义词替换攻击的认证性能上超越最先进的现有方案外，我们还首次提供了针对四种词级操作的认证准确率和半径基准。