Dot-product attention mechanism plays a crucial role in modern deep architectures (e.g., Transformer) for sequence modeling, however, na\"ive exact computation of this model incurs quadratic time and memory complexities in sequence length, hindering the training of long-sequence models. Critical bottlenecks are due to the computation of partition functions in the denominator of softmax function as well as the multiplication of the softmax matrix with the matrix of values. Our key observation is that the former can be reduced to a variant of the kernel density estimation (KDE) problem, and an efficient KDE solver can be further utilized to accelerate the latter via subsampling-based fast matrix products. Our proposed KDEformer can approximate the attention in sub-quadratic time with provable spectral norm bounds, while all prior results merely provide entry-wise error bounds. Empirically, we verify that KDEformer outperforms other attention approximations in terms of accuracy, memory, and runtime on various pre-trained models. On BigGAN image generation, we achieve better generative scores than the exact computation with over $4\times$ speedup. For ImageNet classification with T2T-ViT, KDEformer shows over $18\times$ speedup while the accuracy drop is less than $0.5\%$.
翻译:点积注意力机制在现代深度架构(如Transformer)的序列建模中扮演着关键角色,然而该模型的朴素精确计算会带来序列长度上的二次时间和内存复杂度,阻碍了长序列模型的训练。主要瓶颈源于softmax函数分母中配分函数的计算以及softmax矩阵与值矩阵的乘法。我们的关键发现是,前者可以简化为核密度估计(KDE)问题的一个变体,而高效的KDE求解器可进一步通过基于子采样的快速矩阵乘积来加速后者。我们提出的KDEformer能够在次二次时间内近似注意力,并具有可证明的谱范数界,而先前所有结果仅提供元素级误差界。实验上,我们验证了KDEformer在多种预训练模型上相较于其他注意力近似方法在准确率、内存和运行时间方面的优越性。在BigGAN图像生成任务中,与精确计算相比,我们实现了超过4倍的加速并取得了更好的生成评分。在使用T2T-ViT进行ImageNet分类时,KDEformer实现了超过18倍的加速,同时准确率下降小于0.5%。