Causal self-attention provides positional information to Transformer decoders. Prior work has shown that stacks of causal self-attention layers alone induce a positional bias in attention scores toward earlier tokens. However, this differs from the bias toward later tokens typically observed in Transformer decoders, known as recency bias. We address this discrepancy by analyzing the interaction between causal self-attention and other architectural components. We show that stacked causal self-attention layers combined with LayerNorm induce recency bias. Furthermore, we examine the effects of residual connections and the distribution of input token embeddings on this bias. Our results provide new theoretical insights into how positional information interacts with architectural components and suggest directions for improving positional encoding strategies.
翻译:因果自注意力机制为 Transformer 解码器提供了位置信息。先前研究表明,仅堆叠的因果自注意力层会在注意力分数中产生对较早词元的位置偏差。然而,这与通常在 Transformer 解码器中观察到的、被称为近因偏差的较晚词元偏差有所不同。我们通过分析因果自注意力与其他架构组件之间的相互作用来解决这一矛盾。我们证明了堆叠的因果自注意力层与 LayerNorm 结合会引发近因偏差。此外,我们考察了残差连接以及输入词元嵌入分布对此偏差的影响。我们的结果为位置信息如何与架构组件相互作用提供了新的理论见解,并提出了改进位置编码策略的方向。