Transformer-based language models have achieved state-of-the-art performance in natural language generation (NLG), yet their internal mechanisms for synthesizing task-relevant information remain insufficiently understood. While prior studies suggest that intermediate layers often yield more generalizable representations than final layers, how this generalization ability emerges and propagates across layers during training remains unclear. To address this gap, we propose InfoRidge, an information-theoretic framework, to characterize how predictive information-the mutual information between hidden representations and target outputs-varies across depth during training. Our experiments across various models and datasets reveal a consistent non-monotonic trend: predictive information peaks in intermediate layers-forming a generalization ridge-before declining in final layers, reflecting a transition between generalization and memorization. To further investigate this phenomenon, we conduct a set of complementary analyses that leverage residual scaling, attention pattern, and controlled model capacity to characterize layer-wise functional specialization. We further validate our findings with multiple-token generation experiments, verifying that the observed ridge phenomenon persists across decoding steps. Together, these findings offer new insights into the internal mechanisms of transformers and underscore the critical role of intermediate layers in supporting generalization.
翻译:基于Transformer的语言模型在自然语言生成(NLG)任务中已取得最先进的性能,但其合成任务相关信息的内在机制仍未被充分理解。尽管先前研究表明中间层通常比最终层产生更具泛化性的表征,但这种泛化能力在训练过程中如何产生并在各层间传播仍不明确。为填补这一空白,我们提出InfoRidge——一个信息论框架,用于刻画预测信息(即隐藏表征与目标输出间的互信息)在训练期间如何随深度变化。我们在多种模型和数据集上的实验揭示了一致的非单调趋势:预测信息在中间层达到峰值(形成“泛化岭”),随后在最终层下降,这反映了泛化与记忆化之间的过渡。为进一步探究该现象,我们进行了一系列互补分析,利用残差缩放、注意力模式和受控模型容量来刻画逐层功能特化。我们通过多标记生成实验进一步验证了发现,证实所观测到的岭现象在解码步骤间持续存在。这些发现共同为Transformer的内部机制提供了新见解,并强调了中间层在支持泛化中的关键作用。