Large Language Models (LLMs) exhibit exceptional proficiency in handling extensive context windows in natural language. Nevertheless, the quadratic scaling of attention computation relative to sequence length creates substantial efficiency bottlenecks, necessitating the development of I/O-optimized algorithms. In this work, we conduct a systematic examination of the I/O complexity inherent in attention mechanisms, with a specific emphasis on the backward pass under both small and large cache settings. By leveraging the red-blue pebble game framework, we derive tight bounds for I/O complexity across the full spectrum of cache sizes. We validate that FlashAttention, one of the current industry standards, achieves optimality in the large-cache scenario for both forward and backward passes. Conversely, for small-cache environments, we introduce a novel algorithm that outperforms contemporary methods and successfully attains theoretical tight bounds. Furthermore, we expand our investigation to include sparse attention by establishing granular lower bounds for both forward and backward passes across all cache configurations. Ultimately, our results solidify the theoretical framework regarding I/O complexity in attention mechanisms, providing critical guidance for the development of efficient LLM training and inference systems.
翻译:大型语言模型在处理自然语言的长上下文窗口方面展现出卓越能力。然而,注意力计算相对于序列长度的二次缩放特性造成了显著的效率瓶颈,这促使了I/O优化算法的开发。本研究系统性地考察了注意力机制固有的I/O复杂度,特别聚焦于小缓存与大缓存配置下的反向传播过程。通过运用红蓝卵石博弈框架,我们推导出全缓存尺寸范围内I/O复杂度的紧确界。我们验证了当前行业标准之一的FlashAttention在大缓存场景下对前向与反向传播均达到了最优性。相反,针对小缓存环境,我们提出了一种新型算法,其性能超越现有方法并成功达到了理论紧确界。此外,我们将研究拓展至稀疏注意力机制,通过建立所有缓存配置下前向与反向传播的细粒度下界。最终,我们的研究成果巩固了关于注意力机制I/O复杂度的理论框架,为开发高效的大型语言模型训练与推理系统提供了关键指导。