What makes a difference in the post-training of LLMs? We investigate the training patterns of different layers in large language models (LLMs), through the lens of gradient, when training with different responses and initial models. We are specifically interested in how fast vs. slow thinking affects the layer-wise gradients, given the recent popularity of training LLMs on reasoning paths such as chain-of-thoughts (CoT) and process rewards. In our study, fast thinking without CoT leads to larger gradients and larger differences of gradients across layers than slow thinking (Detailed CoT), indicating the learning stability brought by the latter. Moreover, pre-trained LLMs are less affected by the instability of fast thinking than instruction-tuned LLMs. Additionally, we study whether the gradient patterns can reflect the correctness of responses when training different LLMs using slow vs. fast thinking paths. The results show that the gradients of slow thinking can distinguish correct and irrelevant reasoning paths. As a comparison, we conduct similar gradient analyses on non-reasoning knowledge learning tasks, on which, however, trivially increasing the response length does not lead to similar behaviors of slow thinking. Our study strengthens fundamental understandings of LLM training and sheds novel insights on its efficiency and stability, which pave the way towards building a generalizable System-2 agent. Our code, data, and gradient statistics can be found in: https://github.com/MingLiiii/Layer_Gradient.
翻译:大语言模型的后训练过程中究竟发生了什么变化?本研究通过梯度视角,探究了在不同响应类型和初始模型条件下,大语言模型各层的训练模式。鉴于当前基于推理路径(如思维链)和过程奖励训练大语言模型的流行趋势,我们特别关注快慢思维如何影响层间梯度分布。研究发现:相较于慢思维(详细思维链),无思维链的快思维训练会产生更大的梯度幅值及更显著的层间梯度差异,这揭示了慢思维带来的学习稳定性优势。此外,预训练大语言模型受快思维不稳定性影响的程度低于指令微调模型。进一步地,我们探究了梯度模式能否反映快慢思维路径训练中响应的正确性。结果表明,慢思维路径的梯度能有效区分正确与无关的推理路径。作为对比,我们在非推理知识学习任务上进行了类似梯度分析,发现单纯增加响应长度并不会产生与慢思维类似的行为模式。本研究深化了对大语言模型训练机制的基础理解,为其效率与稳定性提供了新的见解,为构建可泛化的系统二智能体奠定了基础。代码、数据及梯度统计结果详见:https://github.com/MingLiiii/Layer_Gradient。