Transformers can capture long-range dependencies using self-attention, allowing tokens to attend to all others directly. However, stacking multiple attention layers leads to attention concentration. One natural way to address this issue is to use cross-layer attention, allowing information from earlier layers to be directly accessible to later layers. However, this approach is computationally expensive. To address this problem, we propose Transformer with residual value (ResFormer) which approximates cross-layer attention through adding a residual connection from the values of the the first layer to all subsequent layers. Based on this method, one variant is the Transformer with single layer value (SVFormer), where all layers share the same value embedding from first layer, reducing the KV cache by nearly 50%. Comprehensive empirical evidence demonstrates that ResFormer mitigates attention concentration problem in deeper layers and enhances representation across most layers, outperforming the vanilla Transformer, DenseFormer, and NeuTRENO in training error as well as downstream tasks. SVFormer trains significantly faster than the vanilla Transformer and performs better than other methods like GQA and CLA, with performance influenced by sequence length and cumulative learning rate.
翻译:Transformer通过自注意力机制能够捕获长距离依赖关系,使各标记能够直接关注所有其他标记。然而,堆叠多层注意力层会导致注意力集中现象。解决该问题的一种自然方法是采用跨层注意力机制,使后续层能够直接获取早期层的信息,但这种方法计算成本高昂。为解决此问题,我们提出带残差价值的Transformer(ResFormer),该方法通过从首层价值向量到所有后续层添加残差连接来近似实现跨层注意力。基于此方法,我们进一步提出单层价值Transformer(SVFormer)变体,其中所有层共享来自首层的相同价值嵌入,将KV缓存减少近50%。综合实验证据表明,ResFormer有效缓解了深层中的注意力集中问题,并在大多数层中增强了表示能力,在训练误差和下游任务中均优于原始Transformer、DenseFormer及NeuTRENO。SVFormer的训练速度显著快于原始Transformer,其性能优于GQA和CLA等方法,且受序列长度和累积学习率的影响。