Large language models (LLMs) may exhibit undesirable behaviors. Recent efforts have focused on aligning these models to prevent harmful generation. Despite these efforts, studies have shown that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Do alignment fine-tuning have robust effects on models, or are merely superficial? In this work, we answer this question through both theoretical and empirical means. Empirically, we demonstrate the elasticity of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Using compression theory, we formally derive that such fine-tuning process disproportionately undermines alignment compared to pre-training, potentially by orders of magnitude. We conduct experimental validations to confirm the presence of elasticity across models of varying types and sizes. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. We further reveal that elasticity positively correlates with increased model size and the expansion of pre-training data. Our discovery signifies the importance of taming the inherent elasticity of LLMs, thereby overcoming the resistance of LLMs to alignment finetuning.
翻译:大型语言模型(LLM)可能表现出不良行为。近期研究致力于对这些模型进行对齐,以防止有害内容的生成。尽管付出了这些努力,研究表明,即使是一个执行良好的对齐过程,也可能被轻易规避,无论是有意还是无意。对齐微调是否对模型具有稳健的影响,抑或仅仅是表面现象?在本工作中,我们通过理论和实证两种方式回答这个问题。实证上,我们展示了后对齐模型的弹性,即模型在进一步微调时倾向于恢复到预训练阶段形成的行为分布。利用压缩理论,我们正式推导出这种微调过程对对齐效果的破坏程度不成比例地高于对预训练的影响,可能相差数个数量级。我们进行了实验验证,以确认在不同类型和规模的模型中均存在这种弹性。具体而言,我们发现模型性能在恢复到预训练分布之前迅速下降,之后下降速率显著降低。我们进一步揭示,弹性与模型规模的增大以及预训练数据的扩展呈正相关。我们的发现意味着驯服LLM固有弹性的重要性,从而克服LLM对对齐微调的抵抗。