RLHF-style alignment trains language models to refuse unsafe requests, but how much operational margin does this refusal rest on? We introduce the refusal-affirmation logit gap: the difference between the top refusal-token logit and the top affirmative-token logit at the first decoding step. This single scalar quantifies the per-prompt safety margin that alignment provides. Empirically, alignment widens the gap on 97.5-99.8% of toxic prompts across three model families, and median gap closure co-varies with True-ASR ranking across suffix strategies (an internal consistency check, since our method optimises gap closure). To validate the metric's practical significance, we present logit-gap steering, a gradient-free, forward-pass-only method that discovers short in-distribution suffixes ($<$10 tokens per component) whose cumulative effect closes the gap. The method requires ${\approx}26{,}000$ forward-pass equivalents per family (${\approx}2$~min on one A100), ${\approx}125\times$ less than a single GCG search. Suffixes discovered on 0.5B--2B models transfer without modification to 72B within family. An 8-suffix ensemble reaches 38-96\% True ASR across 13 models on AdvBench and HarmBench, with most suffixes having $10^{3}$-$10^{4}\times$ lower perplexity than GCG-meaning published perplexity-filter defenses that collapse GCG (64.7%$\to$1.0%) leave our suffixes nearly intact (76.9%$\to$76.0%). These results demonstrate that current alignment margins, while consistently present, can be thin and efficiently measurable, and that defense strategies must account for in-distribution suffixes.
翻译:RLHF式的对齐训练使语言模型能够拒绝不安全请求,但这种拒绝依赖于多大的操作余量?我们引入拒绝-肯定对数间隙:在首个解码步骤中,最高拒绝标记对数与最高肯定标记对数之间的差值。这一标量量化了对齐为每个提示提供的安全余量。实验表明,对齐在三个模型族中将97.5%-99.8%的有毒提示的对数间隙扩大,且中位数间隙闭合与各后缀策略的True-ASR排名共变(作为一种内部一致性检验,因为我们的方法优化了间隙闭合)。为验证该指标的实用意义,我们提出对数-间隙导向——一种无梯度、仅依赖前向传播的方法,可发现短距离分布内后缀(每个组件少于10个标记),其累积效应能闭合间隙。该方法每模型族仅需约26,000次前向传播等价计算(约2分钟在单个A100上),比单次GCG搜索少约125倍。在0.5B-2B模型上发现的后缀可直接迁移至同族的72B模型。一个由8个后缀组成的集成在AdvBench和HarmBench上对13个模型达到38%-96%的True ASR,且多数后缀的困惑度比GCG低10³-10⁴倍——这意味着此前使GCG失效的困惑度过滤防御(64.7%→1.0%)几乎不影响我们的后缀(76.9%→76.0%)。这些结果表明,当前对齐余量虽然普遍存在,但可能较薄且可被高效测量,防御策略必须考虑分布内后缀。