As the foundation of large language models (LLMs), self-attention module faces the challenge of quadratic time and memory complexity with respect to sequence length. FlashAttention accelerates attention computation and reduces its memory usage by leveraging the GPU memory hierarchy. A promising research direction is to integrate FlashAttention with quantization methods. This paper introduces INT-FlashAttention, the first INT8 quantization architecture compatible with the forward workflow of FlashAttention, which significantly improves the inference speed of FlashAttention on Ampere GPUs. We implement our INT-FlashAttention prototype with fully INT8 activations and general matrix-multiplication (GEMM) kernels, making it the first attention operator with fully INT8 input. As a general token-level post-training quantization framework, INT-FlashAttention is also compatible with other data formats like INT4, etc. Experimental results show INT-FlashAttention achieves 72% faster inference speed and 82% smaller quantization error compared to standard FlashAttention with FP16 and FP8 data format.
翻译:作为大语言模型(LLM)的基础,自注意力模块面临着计算时间和内存复杂度随序列长度呈二次方增长的挑战。FlashAttention通过利用GPU内存层次结构,加速了注意力计算并降低了其内存占用。一个前景广阔的研究方向是将FlashAttention与量化方法相结合。本文提出了INT-FlashAttention,这是首个兼容FlashAttention前向工作流程的INT8量化架构,它显著提升了FlashAttention在Ampere GPU上的推理速度。我们实现了INT-FlashAttention原型,采用全INT8激活和通用矩阵乘法(GEMM)内核,使其成为首个具有全INT8输入的注意力算子。作为一个通用的令牌级训练后量化框架,INT-FlashAttention也兼容INT4等其他数据格式。实验结果表明,与采用FP16和FP8数据格式的标准FlashAttention相比,INT-FlashAttention实现了72%的推理速度提升和82%的量化误差减小。