As the development of large language models (LLMs) rapidly advances, securing these models effectively without compromising their utility has become a pivotal area of research. However, current defense strategies against jailbreak attacks (i.e., efforts to bypass security protocols) often suffer from limited adaptability, restricted general capability, and high cost. To address these challenges, we introduce SafeAligner, a methodology implemented at the decoding stage to fortify defenses against jailbreak attacks. We begin by developing two specialized models: the Sentinel Model, which is trained to foster safety, and the Intruder Model, designed to generate riskier responses. SafeAligner leverages the disparity in security levels between the responses from these models to differentiate between harmful and beneficial tokens, effectively guiding the safety alignment by altering the output token distribution of the target model. Extensive experiments show that SafeAligner can increase the likelihood of beneficial tokens, while reducing the occurrence of harmful ones, thereby ensuring secure alignment with minimal loss to generality.
翻译:随着大语言模型的快速发展,如何在不损害其可用性的前提下有效保障模型安全已成为关键研究领域。然而,当前针对越狱攻击的防御策略普遍存在适应性有限、通用能力受限以及成本高昂等问题。为应对这些挑战,我们提出了SafeAligner,一种在解码阶段实施、旨在增强对越狱攻击防御能力的方法。我们首先构建了两个专用模型:旨在提升安全性的哨兵模型,以及用于生成高风险响应的入侵模型。SafeAligner利用这两个模型响应之间的安全水平差异来区分有害与有益的词元,并通过改变目标模型的输出词元分布,有效引导安全对齐过程。大量实验表明,SafeAligner能够显著提高有益词元的生成概率,同时降低有害词元的出现频率,从而在保证安全对齐的同时,将通用性损失降至最低。