Transformer-based language models have achieved state-of-the-art performance in natural language generation (NLG), yet their internal mechanisms for synthesizing task-relevant information remain insufficiently understood. While prior studies suggest that intermediate layers often yield more generalizable representations than final layers, how this generalization ability emerges and propagates across layers during training remains unclear. To address this gap, we propose InfoRidge, an information-theoretic framework, to characterize how predictive information-the mutual information between hidden representations and target outputs-varies across depth during training. Our experiments across various models and datasets reveal a consistent non-monotonic trend: predictive information peaks in intermediate layers-forming a generalization ridge-before declining in final layers, reflecting a transition between generalization and memorization. To further investigate this phenomenon, we conduct a set of complementary analyses that leverage residual scaling, attention pattern, and controlled model capacity to characterize layer-wise functional specialization. We further validate our findings with multiple-token generation experiments, verifying that the observed ridge phenomenon persists across decoding steps. Together, these findings offer new insights into the internal mechanisms of transformers and underscore the critical role of intermediate layers in supporting generalization.
翻译:基于Transformer的语言模型在自然语言生成(NLG)任务中已取得最先进的性能,但其合成任务相关信息的内在机制仍未得到充分理解。尽管先前研究表明中间层通常比最终层产生更具泛化能力的表征,但这种泛化能力如何在训练过程中随层间传播并形成尚不明确。为填补这一空白,我们提出InfoRidge——一个信息论框架,用于刻画预测信息(即隐藏表征与目标输出之间的互信息)在训练过程中随深度变化的规律。我们在多种模型和数据集上的实验揭示了一致的非单调趋势:预测信息在中间层达到峰值(形成“泛化岭”),随后在最终层下降,这反映了泛化与记忆之间的过渡。为进一步探究该现象,我们通过残差缩放、注意力模式分析和可控模型容量等一系列互补分析来刻画逐层功能特化。我们通过多标记生成实验进一步验证了发现,证实所观察到的“岭”现象在不同解码步骤中持续存在。这些发现共同为Transformer的内部机制提供了新见解,并强调了中间层在支持泛化中的关键作用。