Safety-aligned language models refuse harmful requests through learned refusal behaviors encoded in their internal representations. Recent activation-based jailbreaking methods circumvent these safety mechanisms by applying orthogonal projections to remove refusal directions, but these approaches treat refusal as a one-dimensional phenomenon and ignore the rich distributional structure of model activations. We introduce a principled framework based on optimal transport theory that transforms the entire distribution of harmful activations to match harmless ones. By combining PCA with closed-form Gaussian optimal transport, we achieve efficient computation in high-dimensional representation spaces while preserving essential geometric structure. Across six models (Llama-2, Llama-3.1, Qwen-2.5; 7B-32B parameters), our method achieves up to 11% higher attack success rates than state-of-the-art baselines while maintaining comparable perplexity, demonstrating superior preservation of model capabilities. Critically, we discover that layer-selective intervention (applying optimal transport to 1-2 carefully chosen layers at approximately 40-60% network depth) substantially outperforms full-network interventions, revealing that refusal mechanisms may be localized rather than distributed. Our analysis provides new insights into the geometric structure of safety representations and suggests that current alignment methods may be vulnerable to distributional attacks beyond simple direction removal.
翻译:安全对齐的语言模型通过学习编码在其内部表征中的拒绝行为来拒绝有害请求。最近的基于激活的越狱方法通过应用正交投影来消除拒绝方向,从而规避这些安全机制,但这些方法将拒绝视为一维现象,忽略了模型激活的丰富分布结构。我们引入了一个基于最优传输理论的原理性框架,该框架将有害激活的整个分布转换为与无害激活相匹配。通过将主成分分析与闭式高斯最优传输相结合,我们在高维表征空间中实现了高效计算,同时保留了关键的几何结构。在六个模型(Llama-2、Llama-3.1、Qwen-2.5;7B-32B参数)上的实验表明,我们的方法实现了比最先进基线高出11%的攻击成功率,同时保持了可比的困惑度,证明了其对模型能力的优越保留。关键的是,我们发现分层选择性干预(在约40-60%网络深度的1-2个精心选择的层上应用最优传输)显著优于全网络干预,这表明拒绝机制可能是局部化的而非分布式。我们的分析为安全表征的几何结构提供了新的见解,并表明当前的对齐方法可能容易受到超越简单方向消除的分布性攻击。