Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences

Efficient Transformers have been developed for long sequence modeling, due to their subquadratic memory and time complexity. Sparse Transformer is a popular approach to improving the efficiency of Transformers by restricting self-attention to locations specified by the predefined sparse patterns. However, leveraging sparsity may sacrifice expressiveness compared to full-attention, when important token correlations are multiple hops away. To combine advantages of both the efficiency of sparse transformer and the expressiveness of full-attention Transformer, we propose \textit{Diffuser}, a new state-of-the-art efficient Transformer. Diffuser incorporates all token interactions within one attention layer while maintaining low computation and memory costs. The key idea is to expand the receptive field of sparse attention using Attention Diffusion, which computes multi-hop token correlations based on all paths between corresponding disconnected tokens, besides attention among neighboring tokens. Theoretically, we show the expressiveness of Diffuser as a universal sequence approximator for sequence-to-sequence modeling, and investigate its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective. Experimentally, we investigate the effectiveness of Diffuser with extensive evaluations, including language modeling, image modeling, and Long Range Arena (LRA). Evaluation results show that Diffuser achieves improvements by an average of 0.94% on text classification tasks and 2.30% on LRA, with 1.67$\times$ memory savings compared to state-of-the-art benchmarks, which demonstrates superior performance of Diffuser in both expressiveness and efficiency aspects.

翻译：高效Transformer因其亚二次存储和时间复杂度而被开发用于长序列建模。稀疏Transformer通过将自注意力限制在预定义稀疏模式指定的位置上，是提升Transformer效率的常用方法。然而，当重要令牌相关性需要跨越多个跳步时，利用稀疏性可能会牺牲表达能力。为了兼顾稀疏Transformer的高效性与全注意力Transformer的表达能力，我们提出\textit{Diffuser}——一种新的最先进的高效Transformer。Diffuser在单个注意力层中整合了所有令牌交互，同时保持较低的计算和存储成本。其核心思想是通过注意力扩散扩展稀疏注意力的感受野，该机制除了计算邻近令牌的注意力外，还基于对应不连通令牌之间的所有路径计算多跳令牌相关性。理论上，我们证明了Diffuser作为序列到序列建模的通用序列逼近器的表达能力，并通过谱角度分析图扩展特性研究了其逼近全注意力的能力。实验上，我们通过广泛评估（包括语言建模、图像建模和长距离竞技场LRA）考察了Diffuser的有效性。评估结果显示，相比最先进的基准模型，Diffuser在文本分类任务上平均提升0.94%，在LRA上平均提升2.30%，同时节省1.67倍的存储空间，这证明了Diffuser在表达能力和效率方面的卓越性能。