We study how information propagates in decoder-only Transformers, which are the architectural backbone of most existing frontier large language models (LLMs). We rely on a theoretical signal propagation analysis -- specifically, we analyse the representations of the last token in the final layer of the Transformer, as this is the representation used for next-token prediction. Our analysis reveals a representational collapse phenomenon: we prove that certain distinct sequences of inputs to the Transformer can yield arbitrarily close representations in the final token. This effect is exacerbated by the low-precision floating-point formats frequently used in modern LLMs. As a result, the model is provably unable to respond to these sequences in different ways -- leading to errors in, e.g., tasks involving counting or copying. Further, we show that decoder-only Transformer language models can lose sensitivity to specific tokens in the input, which relates to the well-known phenomenon of over-squashing in graph neural networks. We provide empirical evidence supporting our claims on contemporary LLMs. Our theory also points to simple solutions towards ameliorating these issues.
翻译:我们研究了仅解码器Transformer中的信息传播机制,该架构构成了当前大多数前沿大语言模型(LLMs)的核心基础。我们基于理论上的信号传播分析——具体而言,我们重点分析Transformer最后一层中最后一个标记的表征,因为该表征被用于下一标记预测。我们的分析揭示了一种表征坍缩现象:我们证明,某些不同的输入序列在Transformer中可能生成任意接近的最终标记表征。这种效应会因现代大语言模型常用的低精度浮点数格式而加剧。其结果是,模型可被证明无法对这些序列作出差异化响应——从而导致在计数或复制等任务中出现错误。此外,我们发现仅解码器Transformer语言模型可能丧失对输入中特定标记的敏感性,这与图神经网络中众所周知的过度挤压现象存在理论关联。我们在当代大语言模型上提供了支持这些论断的实验证据。我们的理论同时指出了缓解这些问题的简明解决方案。