Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, we demonstrate the elasticity of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. We validate the presence of elasticity through experiments on models of varying types and scales. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, we further reveal that elasticity positively correlates with the increased model size and the expansion of pre-training data. Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment.
翻译:大型语言模型(LLMs)可能表现出非预期或不良行为。近期研究集中于对齐LLMs以减轻有害输出。尽管付出了这些努力,一些异常现象表明,即使是执行良好的对齐过程也可能被轻易规避,无论是有意还是无意。对齐微调是否对模型产生了稳健的影响,抑或其影响仅是表面的?在本工作中,我们从理论和实证角度首次探索了这一现象。实证上,我们展示了后对齐模型的弹性,即模型在进一步微调时倾向于恢复到预训练阶段形成的行为分布。利用压缩理论,我们正式推导出微调不成比例地削弱了对齐效果,其程度可能比预训练高出数个数量级。我们通过在多种类型和规模的模型上进行实验,验证了弹性的存在。具体而言,我们发现模型性能在恢复到预训练分布之前迅速下降,之后下降速率显著降低。此外,我们进一步揭示弹性与模型规模的增大以及预训练数据的扩展呈正相关。我们的发现强调需要解决LLMs固有的弹性,以减轻其对对齐的抵抗。