Transformers can capture long-range dependencies using self-attention, allowing tokens to attend to all others directly. However, stacking multiple attention layers leads to attention concentration. One natural way to address this issue is to use cross-layer attention, allowing information from earlier layers to be directly accessible to later layers. However, this approach is computationally expensive. To address this problem, we propose Transformer with residual value (ResFormer) which approximates cross-layer attention through adding a residual connection from the values of the the first layer to all subsequent layers. Based on this method, one variant is the Transformer with single layer value (SVFormer), where all layers share the same value embedding from first layer, reducing the $KV$ cache by nearly 50\%. Comprehensive empirical evidence demonstrates that ResFormer mitigates attention concentration problem in deeper layers and enhances representation across most layers, outperforming the vanilla Transformer, DenseFormer, and NeuTRENO in training error as well as downstream tasks. Further visualization results suggest that Resformer alleviates attention sinks through avoiding value-state drains. SVFormer trains significantly faster than the vanilla Transformer and performs better than other methods like GQA and CLA, with performance influenced by sequence length and cumulative learning rate.
翻译:Transformer通过自注意力机制能够捕获长距离依赖关系,使各标记能够直接关注所有其他标记。然而,堆叠多层注意力层会导致注意力集中现象。解决此问题的一种自然方法是采用跨层注意力机制,使后续层能够直接获取早期层的信息,但这种方法计算成本高昂。为解决该问题,我们提出带残差价值的Transformer(ResFormer),该方法通过从首层的价值向量向所有后续层添加残差连接来近似实现跨层注意力。基于此方法,我们进一步提出单层价值Transformer变体(SVFormer),其中所有层共享来自首层的相同价值嵌入,将$KV$缓存减少近50%。综合实验证据表明,ResFormer有效缓解了深层中的注意力集中问题,并在多数层中增强了表征能力,在训练误差及下游任务中均优于原始Transformer、DenseFormer和NeuTRENO。进一步的可视化结果表明,ResFormer通过避免价值状态耗散来缓解注意力沉滞现象。SVFormer的训练速度显著快于原始Transformer,且性能优于GQA和CLA等方法,其性能受序列长度和累积学习率的影响。