As large language models (LLMs) become increasingly integrated into real-world applications such as code generation and chatbot assistance, extensive efforts have been made to align LLM behavior with human values, including safety. Jailbreak attacks, aiming to provoke unintended and unsafe behaviors from LLMs, remain a significant/leading LLM safety threat. In this paper, we aim to defend LLMs against jailbreak attacks by introducing SafeDecoding, a safety-aware decoding strategy for LLMs to generate helpful and harmless responses to user queries. Our insight in developing SafeDecoding is based on the observation that, even though probabilities of tokens representing harmful contents outweigh those representing harmless responses, safety disclaimers still appear among the top tokens after sorting tokens by probability in descending order. This allows us to mitigate jailbreak attacks by identifying safety disclaimers and amplifying their token probabilities, while simultaneously attenuating the probabilities of token sequences that are aligned with the objectives of jailbreak attacks. We perform extensive experiments on five LLMs using six state-of-the-art jailbreak attacks and four benchmark datasets. Our results show that SafeDecoding significantly reduces the attack success rate and harmfulness of jailbreak attacks without compromising the helpfulness of responses to benign user queries. SafeDecoding outperforms six defense methods.
翻译:随着大型语言模型(LLMs)在代码生成和聊天机器人辅助等实际应用中的日益集成,人们已做出大量努力使LLM行为与人类价值观(包括安全性)保持一致。旨在引发LLM产生意外和不安全行为的越狱攻击,仍然是LLM安全的主要威胁。在本文中,我们通过引入SafeDecoding(一种安全感知的解码策略)来防御LLM越狱攻击,使得LLM能够对用户查询生成有益且无害的响应。开发SafeDecoding的见解基于以下观察:即使代表有害内容的令牌概率超过无害响应,安全免责声明在按概率降序排列令牌后仍出现在顶部令牌中。这使得我们能够通过识别安全免责声明并放大其令牌概率来缓解越狱攻击,同时减弱与越狱攻击目标一致的令牌序列的概率。我们使用六种最先进的越狱攻击和四个基准数据集,对五个LLM进行了广泛实验。结果表明,SafeDecoding显著降低了越狱攻击的成功率和危害性,且不影响对良性用户查询响应的有益性。SafeDecoding优于六种防御方法。