Safety alignment of large language models (LLMs) faces a key challenge: current alignment techniques often only focus on improving safety against harmful prompts, causing LLMs to become over-cautious and refuse to respond to benign prompts. Therefore, a key objective of safe alignment is to enhance safety while simultaneously reducing false refusals. In this paper, we introduce Energy-Driven Steering (EDS), a novel, fine-tuning free framework designed to resolve this challenge through dynamic, inference-time intervention. We trained a lightweight, external Energy-Based Model (EBM) to assign high energy to undesirable (false refusal or jailbreak) states and low energy to desirable (helpful response or safe reject) ones. During inference, EBM maps the LLM's internal activations to an "energy landscape". We use the gradient of the energy function to dynamically steer the LLM's hidden states to low energy regions, correcting the model to generate a desirable response in real-time without modifying its weights. This method decouples behavioral control from the model's core knowledge, offering a flexible solution with minimal computational overhead. Extensive experiments across a wide range of models show our method successfully achieves this objective: it substantially lowers false refusal rates. For example, raising compliance on the ORB-H benchmark from 57.3% to 82.6% while maintaining the baseline safety performance. Our work presents an effective paradigm for building LLMs that achieve both low false refusal rates and high safety.
翻译:大语言模型(LLM)的安全对齐面临一个关键挑战:当前的对齐技术通常仅专注于提升模型对有害提示的安全性,导致LLM变得过度谨慎,甚至拒绝响应良性提示。因此,安全对齐的一个核心目标是在增强安全性的同时,减少错误拒绝。本文提出能量驱动导向(EDS),一种新颖的、无需微调的框架,旨在通过动态的推理时干预来解决这一挑战。我们训练了一个轻量级的外部能量模型(EBM),为不良状态(错误拒绝或越狱)分配高能量,为理想状态(有益响应或安全拒绝)分配低能量。在推理过程中,EBM将LLM的内部激活映射到一个“能量景观”。我们利用能量函数的梯度,动态地将LLM的隐藏状态导向低能量区域,从而在不修改模型权重的情况下实时纠正模型以生成理想响应。该方法将行为控制与模型的核心知识解耦,提供了一种灵活且计算开销最小的解决方案。在多种模型上进行的大量实验表明,我们的方法成功实现了这一目标:它显著降低了错误拒绝率。例如,在ORB-H基准测试中,将合规率从57.3%提升至82.6%,同时保持了基准安全性能。我们的工作为构建兼具低错误拒绝率与高安全性的大语言模型提供了一种有效的范式。