Large Language Models (LLMs) have demonstrated remarkable capabilities in processing long-context information. However, the quadratic complexity of attention computation with respect to sequence length poses significant computational challenges, and I/O aware algorithms have been proposed. This paper presents a comprehensive analysis of the I/O complexity for attention mechanisms, focusing on backward passes by categorizing into small and large cache scenarios. Using the red-blue pebble game framework, we establish tight bounds on I/O complexity across all cache sizes. We confirm that the de facto standard I/O aware algorithm FlashAttention is optimal for both forward and backward passes for the large cache size scenario. For small cache sizes, we provide an algorithm that improves over existing methods and achieves the tight bounds. Additionally, we extend our analysis to sparse attention, a mainstream speeding-up approach, deriving fine-grained lower bounds for both forward and backward passes and both small and large caches. Our findings complete the theoretical foundation for I/O complexity in attention mechanisms, offering insights for designing efficient algorithms of LLM training and inference.
翻译:大型语言模型在处理长上下文信息方面展现出卓越能力。然而,注意力计算相对于序列长度的二次复杂度带来了显著的计算挑战,为此学界提出了I/O感知算法。本文对注意力机制的I/O复杂度进行了全面分析,通过将缓存场景划分为小型与大型两类,重点研究了反向传播过程。运用红蓝卵石博弈框架,我们在所有缓存规模下建立了I/O复杂度的紧确界。我们证实,在大型缓存场景下,实际标准的I/O感知算法FlashAttention对于前向传播和反向传播均达到最优。针对小型缓存场景,我们提出了一种改进现有方法并达到紧确界的算法。此外,我们将分析扩展到稀疏注意力这一主流加速方法,推导了前向/反向传播在小型/大型缓存场景下的细粒度下界。本研究完善了注意力机制I/O复杂度的理论基础,为设计高效的大型语言模型训练与推理算法提供了理论依据。