In contrast to RNNs, which compress previous tokens into a single hidden state, Transformers can attend to all previous tokens directly. However, standard Transformers only use representations from the immediately preceding layer. In this paper, we show that this design choice causes representation collapse and leads to suboptimal performance. To address this issue, we introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that preserves the model's overall memory footprint while expanding its representational capacity by allowing access to hidden states from earlier layers. Through extensive experiments across various architectures and different lookup mechanisms, we demonstrate consistent performance improvements on a wide range of tasks. Moreover, our analysis of the learned representation dynamics and our exploration of depthwise circuits reveal how LIMe integrates information across layers, pointing to promising directions for future research.
翻译:与将先前词元压缩为单一隐藏状态的RNN不同,Transformer能够直接关注所有先前词元。然而,标准Transformer仅使用紧邻前一层的表示。本文表明,这种设计选择会导致表示坍缩,并引发次优性能。为解决此问题,我们引入了层集成记忆(LIMe),这是一种简单而强大的方法,在保持模型整体内存占用的同时,通过允许访问更早层的隐藏状态来扩展其表示能力。通过对多种架构及不同查找机制的大量实验,我们在广泛的任务上展示了一致的性能提升。此外,我们对所学表示动态的分析以及对深度方向电路的探索,揭示了LIMe如何跨层整合信息,为未来研究指明了有前景的方向。