Although large language models (LLMs) have achieved great success in vast real-world applications, their vulnerabilities towards noisy inputs have significantly limited their uses, especially in high-stake environments. In these contexts, it is crucial to ensure that every prediction made by large language models is stable, i.e., LLM predictions should be consistent given minor differences in the input. This largely falls into the study of certified robust LLMs, i.e., all predictions of LLM are certified to be correct in a local region around the input. Randomized smoothing has demonstrated great potential in certifying the robustness and prediction stability of LLMs. However, randomized smoothing requires adding noise to the input before model prediction, and its certification performance depends largely on the model's performance on corrupted data. As a result, its direct application to LLMs remains challenging and often results in a small certification radius. To address this issue, we take advantage of the multitasking nature of LLMs and propose to denoise the corrupted inputs with LLMs in a self-denoising manner. Different from previous works like denoised smoothing, which requires training a separate model to robustify LLM, our method enjoys far better efficiency and flexibility. Our experiment results show that our method outperforms the existing certification methods under both certified robustness and empirical robustness. The codes are available at https://github.com/UCSB-NLP-Chang/SelfDenoise.
翻译:尽管大型语言模型(LLMs)在广泛的实际应用中取得了巨大成功,但它们对噪声输入的脆弱性严重限制了其使用,尤其是在高风险环境中。在这些场景下,确保大型语言模型的每一次预测都保持稳定至关重要,即输入存在微小差异时,LLM的预测应保持一致。这主要涉及对可证明鲁棒LLM的研究,即LLM的所有预测在输入附近的局部区域内被证明是正确的。随机平滑在证明LLM的鲁棒性和预测稳定性方面展现出巨大潜力。然而,随机平滑要求在对模型预测前向输入中添加噪声,其认证性能在很大程度上取决于模型对受损数据的表现。因此,其直接应用于LLM仍具挑战性,且通常导致较小的认证半径。为解决此问题,我们利用LLM的多任务特性,提出了一种自去噪机制来处理受损输入。与先前需要训练独立模型来增强LLM鲁棒性的去噪平滑等工作不同,我们的方法享有更高的效率和灵活性。实验结果表明,在可证明鲁棒性和经验鲁棒性方面,我们的方法均优于现有认证方法。代码已开源至https://github.com/UCSB-NLP-Chang/SelfDenoise。