Language Models Resist Alignment: Evidence From Data Compression

from arxiv, The five-page version has been accepted by NeurIPS 2024 Workshop SoLaR. In the current version, we have conducted an in-depth expansion of both the theoretical and experimental aspects

Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, we demonstrate the elasticity of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. We validate the presence of elasticity through experiments on models of varying types and scales. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, we further reveal that elasticity positively correlates with the increased model size and the expansion of pre-training data. Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment.

翻译：大型语言模型（LLMs）可能表现出非预期或不良行为。近期研究集中于对齐LLMs以减轻有害输出。尽管付出了这些努力，一些异常现象表明，即使是执行良好的对齐过程也可能被轻易规避，无论是有意还是无意。对齐微调是否对模型产生了稳健的影响，抑或其影响仅是表面的？在本工作中，我们从理论和实证角度首次探索了这一现象。实证上，我们展示了后对齐模型的弹性，即模型在进一步微调时倾向于恢复到预训练阶段形成的行为分布。利用压缩理论，我们正式推导出微调不成比例地削弱了对齐效果，其程度可能比预训练高出数个数量级。我们通过在多种类型和规模的模型上进行实验，验证了弹性的存在。具体而言，我们发现模型性能在恢复到预训练分布之前迅速下降，之后下降速率显著降低。此外，我们进一步揭示弹性与模型规模的增大以及预训练数据的扩展呈正相关。我们的发现强调需要解决LLMs固有的弹性，以减轻其对对齐的抵抗。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日