When Autoregressive Consistency Hurts Safety Alignment

Safety alignment in large language models (LLMs) is fragile in part because it is often shallow: fine-tuning mainly reshapes the model's behavior near the first few output tokens. We argue that this phenomenon can be understood through autoregressive consistency, the tendency of next-token prediction to preserve and extend the current response trajectory consistently. By analyzing the learning dynamics of safety alignment, we show that autoregressive consistency can concentrate alignment updates on early tokens, offering a mechanistic explanation for shallow safety alignment. The same mechanism also predicts a broader class of attacks on LLMs: attacks that induce harmful continuation states at arbitrary positions in the output trajectory. As a concrete example, we introduce random insertion attack, which inserts a short harmful span into an otherwise safe refusal trajectory and exploits autoregressive consistency to sustain the resulting harmful branch, thereby bypassing safety alignment. Notably, a short harmful span can redirect the generation to be harmful even after a long refusal prefix, highlighting autoregressive consistency as a potential broader failure mechanism. This suggests that safety alignment should also break harmful autoregressive consistency throughout the output trajectory. We therefore propose adversarial safety alignment, an initial framework based on worst-case harmful continuation states, and instantiate it with random worst-insertion training. Overall, our results suggest that autoregressive consistency should be treated as a central consideration in both safety alignment and attack design.

翻译：大型语言模型（LLMs）的安全对齐之所以脆弱，部分原因在于其通常较为浅层：微调主要重塑模型在前几个输出标记附近的行为。我们认为，这一现象可通过自回归一致性（即下一个标记预测倾向于持续一致地保持并延续当前响应轨迹）来理解。通过分析安全对齐的学习动态，我们证明自回归一致性会将对齐更新集中在早期标记上，从而为浅层安全对齐提供了机制性解释。同一机制还预测了针对LLMs的更广泛攻击类型：即能在输出轨迹任意位置诱导有害延续状态的攻击。作为具体示例，我们提出随机插入攻击——在原本安全的拒绝轨迹中插入一个短的有害片段，利用自回归一致性维持由此产生的有害分支，从而绕过安全对齐。值得注意的是，即使存在较长的拒绝前缀，一个短的有害片段仍能重定向生成过程向有害方向发展，这凸显了自回归一致性可能成为更广泛的失效机制。这表明安全对齐应同时打破输出轨迹中贯穿的有害自回归一致性。因此，我们提出对抗性安全对齐——一个基于最坏情况有害延续状态的初步框架，并通过随机最坏插入训练对其进行了实例化。总体而言，我们的结果表明，自回归一致性应被视为安全对齐与攻击设计中的核心考量因素。