Although large language models (LLMs) have achieved significant success, their vulnerability to adversarial perturbations, including recent jailbreak attacks, has raised considerable concerns. However, the increasing size of these models and their limited access make improving their robustness a challenging task. Among various defense strategies, randomized smoothing has shown great potential for LLMs, as it does not require full access to the model's parameters or fine-tuning via adversarial training. However, randomized smoothing involves adding noise to the input before model prediction, and the final model's robustness largely depends on the model's performance on these noise corrupted data. Its effectiveness is often limited by the model's sub-optimal performance on noisy data. To address this issue, we propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions. We call this procedure self-denoised smoothing. Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility. Our experimental results indicate that our method surpasses existing methods in both empirical and certified robustness in defending against adversarial attacks for both downstream tasks and human alignments (i.e., jailbreak attacks). Our code is publicly available at https://github.com/UCSB-NLP-Chang/SelfDenoise
翻译:尽管大语言模型(LLMs)已取得显著成功,但它们对对抗性扰动(包括最近的越狱攻击)的脆弱性引发了广泛关注。然而,模型规模的不断增大以及对其访问权限的限制,使得提升其鲁棒性成为一项具有挑战性的任务。在各种防御策略中,随机平滑对大语言模型展现出巨大潜力,因为它无需完全访问模型参数,也无需通过对抗训练进行微调。然而,随机平滑需在模型预测前向输入中添加噪声,模型最终的鲁棒性在很大程度上取决于其对含噪数据的处理性能。其有效性常受限于模型在噪声数据上的次优表现。为解决这一问题,我们提出利用大语言模型的多任务特性,先对含噪输入进行去噪,再基于去噪后的版本进行预测。我们将此过程称为自去噪平滑。与计算机视觉中需要训练独立模型来增强大语言模型鲁棒性的传统去噪平滑技术不同,我们的方法具有显著更优的效率和灵活性。实验结果表明,在防御下游任务及人类对齐(即越狱攻击)的对抗性攻击方面,我们的方法在经验鲁棒性和认证鲁棒性上均超越了现有方法。我们的代码已开源,访问地址为 https://github.com/UCSB-NLP-Chang/SelfDenoise。