能量驱动导向：降低大语言模型中的错误拒绝 (Energy-Driven Steering: Reducing False Refusals in Large Language Models)

Safety alignment of large language models (LLMs) faces a key challenge: current alignment techniques often only focus on improving safety against harmful prompts, causing LLMs to become over-cautious and refuse to respond to benign prompts. Therefore, a key objective of safe alignment is to enhance safety while simultaneously reducing false refusals. In this paper, we introduce Energy-Driven Steering (EDS), a novel, fine-tuning free framework designed to resolve this challenge through dynamic, inference-time intervention. We trained a lightweight, external Energy-Based Model (EBM) to assign high energy to undesirable (false refusal or jailbreak) states and low energy to desirable (helpful response or safe reject) ones. During inference, EBM maps the LLM's internal activations to an "energy landscape". We use the gradient of the energy function to dynamically steer the LLM's hidden states to low energy regions, correcting the model to generate a desirable response in real-time without modifying its weights. This method decouples behavioral control from the model's core knowledge, offering a flexible solution with minimal computational overhead. Extensive experiments across a wide range of models show our method successfully achieves this objective: it substantially lowers false refusal rates. For example, raising compliance on the ORB-H benchmark from 57.3% to 82.6% while maintaining the baseline safety performance. Our work presents an effective paradigm for building LLMs that achieve both low false refusal rates and high safety.

翻译：大语言模型（LLM）的安全对齐面临一个关键挑战：当前的对齐技术通常仅专注于提升模型对有害提示的安全性，导致LLM变得过度谨慎，甚至拒绝响应良性提示。因此，安全对齐的一个核心目标是在增强安全性的同时，减少错误拒绝。本文提出能量驱动导向（EDS），一种新颖的、无需微调的框架，旨在通过动态的推理时干预来解决这一挑战。我们训练了一个轻量级的外部能量模型（EBM），为不良状态（错误拒绝或越狱）分配高能量，为理想状态（有益响应或安全拒绝）分配低能量。在推理过程中，EBM将LLM的内部激活映射到一个“能量景观”。我们利用能量函数的梯度，动态地将LLM的隐藏状态导向低能量区域，从而在不修改模型权重的情况下实时纠正模型以生成理想响应。该方法将行为控制与模型的核心知识解耦，提供了一种灵活且计算开销最小的解决方案。在多种模型上进行的大量实验表明，我们的方法成功实现了这一目标：它显著降低了错误拒绝率。例如，在ORB-H基准测试中，将合规率从57.3%提升至82.6%，同时保持了基准安全性能。我们的工作为构建兼具低错误拒绝率与高安全性的大语言模型提供了一种有效的范式。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日