Logit-Gap Steering: A Forward-Pass Diagnostic for Alignment Robustness

RLHF-style alignment trains language models to refuse unsafe requests, but how much operational margin does this refusal rest on? We introduce the refusal-affirmation logit gap: the difference between the top refusal-token logit and the top affirmative-token logit at the first decoding step. This single scalar quantifies the per-prompt safety margin that alignment provides. Empirically, alignment widens the gap on 97.5-99.8% of toxic prompts across three model families, and median gap closure co-varies with True-ASR ranking across suffix strategies (an internal consistency check, since our method optimises gap closure). To validate the metric's practical significance, we present logit-gap steering, a gradient-free, forward-pass-only method that discovers short in-distribution suffixes ($<$10 tokens per component) whose cumulative effect closes the gap. The method requires ${\approx}26{,}000$ forward-pass equivalents per family (${\approx}2$~min on one A100), ${\approx}125\times$ less than a single GCG search. Suffixes discovered on 0.5B--2B models transfer without modification to 72B within family. An 8-suffix ensemble reaches 38-96\% True ASR across 13 models on AdvBench and HarmBench, with most suffixes having $10^{3}$-$10^{4}\times$ lower perplexity than GCG-meaning published perplexity-filter defenses that collapse GCG (64.7%$\to$1.0%) leave our suffixes nearly intact (76.9%$\to$76.0%). These results demonstrate that current alignment margins, while consistently present, can be thin and efficiently measurable, and that defense strategies must account for in-distribution suffixes.

翻译：RLHF式的对齐训练使语言模型能够拒绝不安全请求，但这种拒绝依赖于多大的操作余量？我们引入拒绝-肯定对数间隙：在首个解码步骤中，最高拒绝标记对数与最高肯定标记对数之间的差值。这一标量量化了对齐为每个提示提供的安全余量。实验表明，对齐在三个模型族中将97.5%-99.8%的有毒提示的对数间隙扩大，且中位数间隙闭合与各后缀策略的True-ASR排名共变（作为一种内部一致性检验，因为我们的方法优化了间隙闭合）。为验证该指标的实用意义，我们提出对数-间隙导向——一种无梯度、仅依赖前向传播的方法，可发现短距离分布内后缀（每个组件少于10个标记），其累积效应能闭合间隙。该方法每模型族仅需约26,000次前向传播等价计算（约2分钟在单个A100上），比单次GCG搜索少约125倍。在0.5B-2B模型上发现的后缀可直接迁移至同族的72B模型。一个由8个后缀组成的集成在AdvBench和HarmBench上对13个模型达到38%-96%的True ASR，且多数后缀的困惑度比GCG低10³-10⁴倍——这意味着此前使GCG失效的困惑度过滤防御（64.7%→1.0%）几乎不影响我们的后缀（76.9%→76.0%）。这些结果表明，当前对齐余量虽然普遍存在，但可能较薄且可被高效测量，防御策略必须考虑分布内后缀。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

GPT-5如何对齐？从硬性拒绝到安全完成：走向以输出为中心的安全训练

专知会员服务

9+阅读 · 2025年8月12日