On Fine-Grained I/O Complexity of Attention Backward Passes

Large Language Models (LLMs) exhibit exceptional proficiency in handling extensive context windows in natural language. Nevertheless, the quadratic scaling of attention computation relative to sequence length creates substantial efficiency bottlenecks, necessitating the development of I/O-optimized algorithms. In this work, we conduct a systematic examination of the I/O complexity inherent in attention mechanisms, with a specific emphasis on the backward pass under both small and large cache settings. By leveraging the red-blue pebble game framework, we derive tight bounds for I/O complexity across the full spectrum of cache sizes. We validate that FlashAttention, one of the current industry standards, achieves optimality in the large-cache scenario for both forward and backward passes. Conversely, for small-cache environments, we introduce a novel algorithm that outperforms contemporary methods and successfully attains theoretical tight bounds. Furthermore, we expand our investigation to include sparse attention by establishing granular lower bounds for both forward and backward passes across all cache configurations. Ultimately, our results solidify the theoretical framework regarding I/O complexity in attention mechanisms, providing critical guidance for the development of efficient LLM training and inference systems.

翻译：大型语言模型在处理自然语言的长上下文窗口方面展现出卓越能力。然而，注意力计算相对于序列长度的二次缩放特性造成了显著的效率瓶颈，这促使了I/O优化算法的开发。本研究系统性地考察了注意力机制固有的I/O复杂度，特别聚焦于小缓存与大缓存配置下的反向传播过程。通过运用红蓝卵石博弈框架，我们推导出全缓存尺寸范围内I/O复杂度的紧确界。我们验证了当前行业标准之一的FlashAttention在大缓存场景下对前向与反向传播均达到了最优性。相反，针对小缓存环境，我们提出了一种新型算法，其性能超越现有方法并成功达到了理论紧确界。此外，我们将研究拓展至稀疏注意力机制，通过建立所有缓存配置下前向与反向传播的细粒度下界。最终，我们的研究成果巩固了关于注意力机制I/O复杂度的理论框架，为开发高效的大型语言模型训练与推理系统提供了关键指导。

相关内容

反向传播

关注 354

反向传播一词严格来说仅指用于计算梯度的算法，而不是指如何使用梯度。但是该术语通常被宽松地指整个学习算法，包括如何使用梯度，例如通过随机梯度下降。反向传播将增量计算概括为增量规则中的增量规则，该规则是反向传播的单层版本，然后通过自动微分进行广义化，其中反向传播是反向累积（或“反向模式”）的特例。在机器学习中，反向传播（backprop）是一种广泛用于训练前馈神经网络以进行监督学习的算法。对于其他人工神经网络（ANN）都存在反向传播的一般化–一类算法，通常称为“反向传播”。反向传播算法的工作原理是，通过链规则计算损失函数相对于每个权重的梯度，一次计算一层，从最后一层开始向后迭代，以避免链规则中中间项的冗余计算。

扩散模型中的注意力机制：综述

专知会员服务

24+阅读 · 2025年4月10日

TransMLA：多头潜在注意力（MLA）即为所需

专知会员服务

23+阅读 · 2025年2月13日

强化学习中的注意力机制研究综述

专知会员服务

62+阅读 · 2024年2月24日

【NeurIPS 2021】流形上的注意力机制：规范等变的Transformer

专知会员服务

30+阅读 · 2021年12月2日