Safety alignment in Large Language Models (LLMs) often creates a systematic discrepancy between a model's aligned output and the underlying pre-aligned data distribution. We propose a framework in which the effect of safety alignment on next-token prediction is modeled as a systematic distortion of a pre-alignment distribution. We cast Weak-to-Strong Jailbreaking as a forecast aggregation problem and derive an optimal aggregation strategy characterized by a Gradient Shift in the loss-induced dual space. We show that logit-arithmetic jailbreaking methods are a special case of this framework under cross-entropy loss, and derive a broader family of aggregation rules corresponding to other proper losses. We also propose a new hybrid aggregation rule. Evaluations across red-teaming benchmarks and math utility tasks using frontier models demonstrate that our approach achieves superior Attack Success Rates and lower "Jailbreak Tax" compared with existing methods, especially on the safety-hardened gpt-oss-120b.
翻译:大型语言模型(LLM)的安全对齐通常会在模型的对齐输出与底层预对齐数据分布之间产生系统性差异。我们提出一个框架,其中安全对齐对下一个词元预测的影响被建模为预对齐分布的系统性扭曲。我们将弱到强越狱问题表述为预测聚合问题,并推导出一种最优聚合策略,其特征表现为损失诱导对偶空间中的梯度偏移。我们证明,基于对数算术的越狱方法是该框架在交叉熵损失下的特例,并推导出对应于其他适当损失的更广泛的聚合规则族。我们还提出了一种新的混合聚合规则。在红队测试基准和数学应用任务上使用前沿模型的评估表明,与现有方法相比,我们的方法实现了更高的攻击成功率与更低的“越狱代价”,尤其在安全强化模型gpt-oss-120b上表现突出。