Although quantization for linear layers has been widely used, its application to accelerate the attention process remains limited. To further enhance the efficiency of attention computation compared to SageAttention while maintaining precision, we propose SageAttention2, which utilizes significantly faster 4-bit matrix multiplication (Matmul) alongside additional precision-enhancing techniques. First, we propose to quantize matrixes $(Q, K)$ to INT4 in a hardware-friendly thread-level granularity and quantize matrixes $(\widetilde P, V)$ to FP8. Second, we propose a method to smooth $Q$, enhancing the accuracy of INT4 $QK$. Third, we propose to use an FP32 Matmul buffer for $PV$ to enhance the accuracy of FP8 $\widetilde PV$. The operations per second (OPS) of SageAttention2 surpass FlashAttention2 and xformers by about 3x and 5x on RTX4090, respectively. Comprehensive experiments confirm that our approach incurs negligible end-to-end metrics loss across diverse models, including those for large language processing, image generation, and video generation. The codes are available at https://github.com/thu-ml/SageAttention.
翻译:尽管线性层的量化已被广泛应用,但其在加速注意力计算方面的应用仍较为有限。为在保持精度的前提下进一步提升注意力计算效率(相较于SageAttention),我们提出SageAttention2,该方法采用显著更快的4位矩阵乘法(Matmul)并结合多项精度增强技术。首先,我们提出以硬件友好的线程级粒度将矩阵$(Q, K)$量化为INT4,并将矩阵$(\widetilde P, V)$量化为FP8。其次,我们提出一种平滑$Q$的方法,以提升INT4 $QK$的精度。第三,我们提出为$PV$计算使用FP32矩阵乘法缓冲区,以增强FP8 $\widetilde PV$的精度。在RTX4090上,SageAttention2的每秒运算次数(OPS)分别超过FlashAttention2和xformers约3倍和5倍。综合实验证实,我们的方法在包括大语言处理、图像生成和视频生成在内的多种模型上,仅产生可忽略的端到端指标损失。代码发布于https://github.com/thu-ml/SageAttention。