Safety alignment of large language models currently faces a central challenge: existing alignment techniques often prioritize mitigating responses to harmful prompts at the expense of overcautious behavior, leading models to incorrectly refuse benign requests. A key goal of safe alignment is therefore to improve safety while simultaneously minimizing false refusals. In this work, we introduce Energy Landscape Steering (ELS), a novel, fine-tuning free framework designed to resolve this challenge through dynamic, inference-time intervention. We train a lightweight external Energy-Based Model (EBM) to assign high energy to undesirable states (false refusal or jailbreak) and low energy to desirable states (helpful response or safe reject). During inference, the EBM maps the LLM's internal activations to an energy landscape, and we use the gradient of the energy function to steer the hidden states toward low-energy regions in real time. This dynamically guides the model toward desirable behavior without modifying its parameters. By decoupling behavioral control from the model's core knowledge, ELS provides a flexible and computationally efficient solution. Extensive experiments across diverse models demonstrate its effectiveness, raising compliance on the ORB-H benchmark from 57.3 percent to 82.6 percent while maintaining baseline safety performance. Our work establishes a promising paradigm for building LLMs that simultaneously achieve high safety and low false refusal rates.
翻译:大型语言模型的安全对齐当前面临一个核心挑战:现有对齐技术往往优先缓解对有害提示的响应,却以过度谨慎行为为代价,导致模型错误拒绝良性请求。因此,安全对齐的一个关键目标是在提升安全性的同时最小化错误拒绝。本文提出能量景观引导(ELS),一种无需微调的新型框架,旨在通过动态推理时干预解决这一挑战。我们训练一个轻量级外部能量基模型(EBM),为不良状态(错误拒绝或越狱)分配高能量,为理想状态(有帮助的响应或安全拒绝)分配低能量。在推理过程中,EBM将LLM的内部激活映射到能量景观,并利用能量函数的梯度实时引导隐藏状态向低能量区域移动。这种机制在不修改模型参数的情况下,动态引导模型产生理想行为。通过将行为控制与模型核心知识解耦,ELS提供了一种灵活且计算高效的解决方案。跨多种模型的广泛实验证明了其有效性:在ORB-H基准测试中将合规率从57.3%提升至82.6%,同时保持基线安全性能。本研究为构建同时实现高安全性与低错误拒绝率的大型语言模型确立了一个具有前景的范式。