This paper discusses the internal behavior of Transformer language models. Many recent pre-trained models have been reported to exhibit only slight changes in the angular distance between the input and output hidden state vectors in the middle Transformer layers, despite a disproportionately large ``jump'' in the angular distance occurring in or around the final Transformer layer. To characterize this, we first introduce a quantitative metric for the jump strength around the final layer, and then demonstrate its prevalence across many open-weight models, as well as its amplification throughout pre-training. Assuming such jumps indicate an undesirable property, we propose the jump-suppressing regularizer (JREG) which penalizes this jump during pre-training, thereby encouraging more balanced capability usage across the middle layers. Empirical evaluations of three model sizes of Llama-based models, trained with the proposed JREG method, reveal improved task performance compared to the baseline without altering the model architecture.
翻译:本文探讨了Transformer语言模型的内部行为。尽管在中间Transformer层中,输入与输出隐藏状态向量之间的角度距离仅发生轻微变化,但近期许多预训练模型被报道在最终Transformer层或其附近出现不成比例的大幅度角度距离"跳跃"。为量化这一现象,我们首先引入一个衡量最终层附近跳跃强度的指标,随后证明该现象在众多开源权重模型中普遍存在,且其强度在预训练过程中持续增强。假设此类跳跃反映了不良特性,我们提出跳跃抑制正则化器(JREG),通过在预训练过程中惩罚这种跳跃,促使中间层的能力使用更为均衡。基于Llama架构的三种规模模型在使用JREG方法训练后的实证评估表明,在不改变模型架构的前提下,其任务性能相较于基线模型获得提升。