As large language models (LLMs) become integral to various applications, ensuring both their safety and utility is paramount. Jailbreak attacks, which manipulate LLMs into generating harmful content, pose significant challenges to this balance. Existing defenses, such as prompt engineering and safety fine-tuning, often introduce computational overhead, increase inference latency, and lack runtime flexibility. Moreover, overly restrictive safety measures can degrade model utility by causing refusals of benign queries. In this paper, we introduce Jailbreak Antidote, a method that enables real-time adjustment of LLM safety preferences by manipulating a sparse subset of the model's internal states during inference. By shifting the model's hidden representations along a safety direction with varying strengths, we achieve flexible control over the safety-utility balance without additional token overhead or inference delays. Our analysis reveals that safety-related information in LLMs is sparsely distributed; adjusting approximately 5% of the internal state is as effective as modifying the entire state. Extensive experiments on nine LLMs (ranging from 2 billion to 72 billion parameters), evaluated against ten jailbreak attack methods and compared with six defense strategies, validate the effectiveness and efficiency of our approach. By directly manipulating internal states during reasoning, Jailbreak Antidote offers a lightweight, scalable solution that enhances LLM safety while preserving utility, opening new possibilities for real-time safety mechanisms in widely-deployed AI systems.
翻译:随着大语言模型(LLMs)日益成为各类应用的核心组成部分,确保其安全性与实用性变得至关重要。越狱攻击通过操纵LLMs生成有害内容,对这一平衡构成了严峻挑战。现有防御方法(如提示工程和安全微调)通常会产生计算开销、增加推理延迟,且缺乏运行时灵活性。此外,过度严格的安全措施可能因拒绝良性查询而损害模型实用性。本文提出"越狱解药"方法,通过在推理过程中操纵模型内部状态的稀疏子集,实现大语言模型安全偏好的实时调整。通过沿安全方向以不同强度偏移模型的隐藏表征,我们能够在无需额外词元开销或推理延迟的情况下,灵活控制安全性与实用性的平衡。分析表明,LLMs中的安全相关信息呈稀疏分布;调整约5%的内部状态即可达到与修改全部状态相当的效果。我们在九个参数规模从20亿到720亿不等的LLMs上进行了广泛实验,针对十种越狱攻击方法进行评估,并与六种防御策略进行对比,验证了所提方法的有效性和高效性。通过直接在推理过程中操纵内部状态,越狱解药提供了一种轻量级、可扩展的解决方案,在保持实用性的同时增强LLM安全性,为广泛部署的AI系统中的实时安全机制开辟了新可能。