Many efficient $\textit{approximate}$ self-attention techniques have become prevalent since the inception of the transformer architecture. Two popular classes of these techniques are low-rank and kernel methods. Each of these methods has its strengths. We observe these strengths synergistically complement each other and exploit them to fuse low-rank and kernel methods, producing a new class of transformers: FLuRKA ($\textbf{F}$ast $\textbf{L}$ow-$\textbf{R}$ank & $\textbf{K}$ernel$ \textbf{A}$ttention). FLuRKA are highly $\textit{training-efficient}$ with faster model speeds $\textit{and}$ similar model qualities compared to constituent low-rank and kernel methods. We theoretically and empirically evaluate the speed and quality of FLuRKA. Our model speed analysis posits a variety of parameter configurations where FLuRKA exhibit speedups over low-rank and kernel approximations and our model quality analysis bounds the error of FLuRKA with respect to full-attention. Empirically, we instantiate three FLuRKA variants which experience speedups of up to 3.3x and 1.7x over low-rank and kernel methods respectively. This translates to speedups of up to 20x over models with flash-attention. Across a diverse set of tasks spanning language modeling, language understanding, long sequence modeling, machine translation, and image classification, FLuRKA achieve comparable accuracy with underlying low-rank and kernel approximations, occasionally surpassing both.
翻译:自Transformer架构诞生以来,许多高效的近似自注意力技术已变得普遍。其中两类主流技术是低秩方法与核方法。这两类方法各有优势。我们观察到这些优势能够协同互补,并利用它们将低秩方法与核方法相融合,从而产生了一类新的Transformer模型:FLuRKA(快速低秩与核注意力)。与构成它的低秩和核方法相比,FLuRKA具有极高的训练效率,模型速度更快且模型质量相当。我们从理论和实证两方面评估了FLuRKA的速度与质量。我们的模型速度分析提出了多种参数配置,在这些配置下FLuRKA相较于低秩和核近似方法表现出加速;而我们的模型质量分析则界定了FLuRKA相对于全注意力的误差。在实证中,我们实例化了三种FLuRKA变体,它们分别实现了最高达3.3倍和1.7倍于低秩方法和核方法的加速。这相当于相对于使用闪存注意力的模型实现了最高达20倍的加速。在涵盖语言建模、语言理解、长序列建模、机器翻译和图像分类的多样化任务集上,FLuRKA达到了与底层低秩和核近似方法相当的准确率,有时甚至超越了两者。