Transformers have shown dominant performance across a range of domains including language and vision. However, their computational cost grows quadratically with the sequence length, making their usage prohibitive for resource-constrained applications. To counter this, our approach is to divide the whole sequence into segments and use local attention mechanism on the individual segments. We propose a segmented recurrent transformer (SRformer) that combines segmented (local) attention with recurrent attention. The loss caused by reducing the attention window length is compensated by aggregating information across segments with recurrent attention. SRformer leverages Recurrent Accumulate-and-Fire (RAF) neurons' inherent memory to update the cumulative product of keys and values. The segmented attention and lightweight RAF neurons ensure the efficiency of the proposed transformer. Such an approach leads to models with sequential processing capability at a lower computation/memory cost. We apply the proposed method to T5 and BART transformers. The modified models are tested on summarization datasets including CNN-dailymail, XSUM, ArXiv, and MediaSUM. Notably, using segmented inputs of varied sizes, the proposed model achieves $6-22\%$ higher ROUGE1 scores than a segmented transformer and outperforms other recurrent transformer approaches. Furthermore, compared to full attention, the proposed model reduces the computational complexity of cross attention by around $40\%$.
翻译:Transformer已在语言与视觉等多个领域展现出卓越性能。然而其计算成本随序列长度呈二次方增长,严重制约了资源受限场景下的应用。对此,我们提出将完整序列分割为若干分段,并对每个分段采用局部注意力机制。本文提出分段循环Transformer(SRformer),该模型融合了分段(局部)注意力与循环注意力。通过循环注意力跨分段聚合信息,可补偿因缩短注意力窗口长度导致的性能损失。SRformer利用循环累积-激发(RAF)神经元的固有记忆特性,更新键值对的累积乘积。分段注意力与轻量级RAF神经元共同确保了所提Transformer的高效性。这种方案使模型能够以更低的计算/内存成本实现序列化处理能力。我们将该方法应用于T5和BART Transformer架构,并在CNN-dailymail、XSUM、ArXiv及MediaSUM等摘要数据集上测试改进模型。值得关注的是,在采用不同尺寸的分段输入时,本模型相比分段Transformer的ROUGE1分数提升6-22%,且优于其他循环Transformer方法。此外,与全注意力机制相比,本模型将交叉注意力的计算复杂度降低了约40%。