Learning dynamics, which describes how the learning of specific training examples influences the model's predictions on other examples, gives us a powerful tool for understanding the behavior of deep learning systems. We study the learning dynamics of large language models during different types of finetuning, by analyzing the step-wise decomposition of how influence accumulates among different potential responses. Our framework allows a uniform interpretation of many interesting observations about the training of popular algorithms for both instruction tuning and preference tuning. In particular, we propose a hypothetical explanation of why specific types of hallucination are strengthened after finetuning, e.g., the model might use phrases or facts in the response for question B to answer question A, or the model might keep repeating similar simple phrases when generating responses. We also extend our framework and highlight a unique "squeezing effect" to explain a previously observed phenomenon in off-policy direct preference optimization (DPO), where running DPO for too long makes even the desired outputs less likely. This framework also provides insights into where the benefits of on-policy DPO and other variants come from. The analysis not only provides a novel perspective of understanding LLM's finetuning but also inspires a simple, effective method to improve alignment performance.
翻译:学习动态描述了特定训练样本的学习如何影响模型对其他样本的预测,为我们理解深度学习系统的行为提供了有力工具。本研究通过分析不同潜在响应间影响力逐步累积的分解过程,探究了大语言模型在各类微调过程中的学习动态。我们的框架能够统一解释关于指令微调与偏好微调主流算法训练过程中的诸多有趣现象。特别地,我们提出了一种假设性解释,说明为何微调后特定类型的幻觉会增强——例如模型可能使用问题B回答中的短语或事实来回答问题A,或在生成响应时持续重复相似的简单短语。我们进一步扩展该框架,通过揭示独特的"挤压效应"来解释离线策略直接偏好优化(DPO)中观察到的现象:过长时间的DPO运行甚至会导致期望输出的概率降低。该框架同时揭示了在线策略DPO及其变体带来收益的内在机理。此项分析不仅为理解大语言模型微调提供了新颖视角,还启发了一种简单有效的对齐性能提升方法。