Language Models Resist Alignment

Large language models (LLMs) may exhibit undesirable behaviors. Recent efforts have focused on aligning these models to prevent harmful generation. Despite these efforts, studies have shown that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Do alignment fine-tuning have robust effects on models, or are merely superficial? In this work, we answer this question through both theoretical and empirical means. Empirically, we demonstrate the elasticity of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Using compression theory, we formally derive that such fine-tuning process disproportionately undermines alignment compared to pre-training, potentially by orders of magnitude. We conduct experimental validations to confirm the presence of elasticity across models of varying types and sizes. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. We further reveal that elasticity positively correlates with increased model size and the expansion of pre-training data. Our discovery signifies the importance of taming the inherent elasticity of LLMs, thereby overcoming the resistance of LLMs to alignment finetuning.

翻译：大型语言模型（LLM）可能表现出不良行为。近期研究致力于对这些模型进行对齐，以防止有害内容的生成。尽管付出了这些努力，研究表明，即使是一个执行良好的对齐过程，也可能被轻易规避，无论是有意还是无意。对齐微调是否对模型具有稳健的影响，抑或仅仅是表面现象？在本工作中，我们通过理论和实证两种方式回答这个问题。实证上，我们展示了后对齐模型的弹性，即模型在进一步微调时倾向于恢复到预训练阶段形成的行为分布。利用压缩理论，我们正式推导出这种微调过程对对齐效果的破坏程度不成比例地高于对预训练的影响，可能相差数个数量级。我们进行了实验验证，以确认在不同类型和规模的模型中均存在这种弹性。具体而言，我们发现模型性能在恢复到预训练分布之前迅速下降，之后下降速率显著降低。我们进一步揭示，弹性与模型规模的增大以及预训练数据的扩展呈正相关。我们的发现意味着驯服LLM固有弹性的重要性，从而克服LLM对对齐微调的抵抗。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日