CR-UTP: Certified Robustness against Universal Text Perturbations

It is imperative to ensure the stability of every prediction made by a language model; that is, a language's prediction should remain consistent despite minor input variations, like word substitutions. In this paper, we investigate the problem of certifying a language model's robustness against Universal Text Perturbations (UTPs), which have been widely used in universal adversarial attacks and backdoor attacks. Existing certified robustness based on random smoothing has shown considerable promise in certifying the input-specific text perturbations (ISTPs), operating under the assumption that any random alteration of a sample's clean or adversarial words would negate the impact of sample-wise perturbations. However, with UTPs, masking only the adversarial words can eliminate the attack. A naive method is to simply increase the masking ratio and the likelihood of masking attack tokens, but it leads to a significant reduction in both certified accuracy and the certified radius due to input corruption by extensive masking. To solve this challenge, we introduce a novel approach, the superior prompt search method, designed to identify a superior prompt that maintains higher certified accuracy under extensive masking. Additionally, we theoretically motivate why ensembles are a particularly suitable choice as base prompts for random smoothing. The method is denoted by superior prompt ensembling technique. We also empirically confirm this technique, obtaining state-of-the-art results in multiple settings. These methodologies, for the first time, enable high certified accuracy against both UTPs and ISTPs. The source code of CR-UTP is available at https://github.com/UCFML-Research/CR-UTP.

翻译：确保语言模型每次预测的稳定性至关重要；也就是说，尽管输入存在微小变化（如词语替换），语言的预测应保持一致。本文研究了认证语言模型针对通用文本扰动（UTPs）的鲁棒性问题，UTPs已广泛用于通用对抗攻击和后门攻击。基于随机平滑的现有认证鲁棒性方法在认证输入特定文本扰动（ISTPs）方面显示出巨大潜力，其运作基于以下假设：对样本干净词或对抗词的任何随机修改都会抵消样本级扰动的影响。然而，对于UTPs，仅掩码对抗词即可消除攻击。一种简单方法是直接提高掩码比例和掩码攻击词符的概率，但这会因大量掩码导致的输入损坏，显著降低认证精度和认证半径。为解决这一挑战，我们引入了一种新方法——优提示搜索法，旨在找到一个优提示，使其在大量掩码下仍能保持较高的认证精度。此外，我们从理论上论证了为何集成方法特别适合作为随机平滑的基础提示。该方法被称为优提示集成技术。我们也通过实验验证了该技术，在多种设置下获得了最先进的结果。这些方法首次实现了对UTPs和ISTPs的高认证精度。CR-UTP的源代码可在 https://github.com/UCFML-Research/CR-UTP 获取。