Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

As large language models (LLMs) become integral to various applications, ensuring both their safety and utility is paramount. Jailbreak attacks, which manipulate LLMs into generating harmful content, pose significant challenges to this balance. Existing defenses, such as prompt engineering and safety fine-tuning, often introduce computational overhead, increase inference latency, and lack runtime flexibility. Moreover, overly restrictive safety measures can degrade model utility by causing refusals of benign queries. In this paper, we introduce Jailbreak Antidote, a method that enables real-time adjustment of LLM safety preferences by manipulating a sparse subset of the model's internal states during inference. By shifting the model's hidden representations along a safety direction with varying strengths, we achieve flexible control over the safety-utility balance without additional token overhead or inference delays. Our analysis reveals that safety-related information in LLMs is sparsely distributed; adjusting approximately 5% of the internal state is as effective as modifying the entire state. Extensive experiments on nine LLMs (ranging from 2 billion to 72 billion parameters), evaluated against ten jailbreak attack methods and compared with six defense strategies, validate the effectiveness and efficiency of our approach. By directly manipulating internal states during reasoning, Jailbreak Antidote offers a lightweight, scalable solution that enhances LLM safety while preserving utility, opening new possibilities for real-time safety mechanisms in widely-deployed AI systems.

翻译：随着大语言模型（LLMs）日益成为各类应用的核心组成部分，确保其安全性与实用性变得至关重要。越狱攻击通过操纵LLMs生成有害内容，对这一平衡构成了严峻挑战。现有防御方法（如提示工程和安全微调）通常会产生计算开销、增加推理延迟，且缺乏运行时灵活性。此外，过度严格的安全措施可能因拒绝良性查询而损害模型实用性。本文提出"越狱解药"方法，通过在推理过程中操纵模型内部状态的稀疏子集，实现大语言模型安全偏好的实时调整。通过沿安全方向以不同强度偏移模型的隐藏表征，我们能够在无需额外词元开销或推理延迟的情况下，灵活控制安全性与实用性的平衡。分析表明，LLMs中的安全相关信息呈稀疏分布；调整约5%的内部状态即可达到与修改全部状态相当的效果。我们在九个参数规模从20亿到720亿不等的LLMs上进行了广泛实验，针对十种越狱攻击方法进行评估，并与六种防御策略进行对比，验证了所提方法的有效性和高效性。通过直接在推理过程中操纵内部状态，越狱解药提供了一种轻量级、可扩展的解决方案，在保持实用性的同时增强LLM安全性，为广泛部署的AI系统中的实时安全机制开辟了新可能。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日