Many efficient approximate self-attention techniques have become prevalent since the inception of the transformer architecture. Two popular classes of these techniques are low-rank and kernel methods. Each of these methods has its own strengths. We observe these strengths synergistically complement each other and exploit these synergies to fuse low-rank and kernel methods, producing a new class of transformers: FLuRKA (Fast Low-Rank and Kernel Attention). FLuRKA provide sizable performance gains over these approximate techniques and are of high quality. We theoretically and empirically evaluate both the runtime performance and quality of FLuRKA. Our runtime analysis posits a variety of parameter configurations where FLuRKA exhibit speedups and our accuracy analysis bounds the error of FLuRKA with respect to full-attention. We instantiate three FLuRKA variants which experience empirical speedups of up to 3.3x and 1.7x over low-rank and kernel methods respectively. This translates to speedups of up to 30x over models with full-attention. With respect to model quality, FLuRKA can match the accuracy of low-rank and kernel methods on GLUE after pre-training on wiki-text 103. When pre-training on a fixed time budget, FLuRKA yield better perplexity scores than models with full-attention.
翻译:自Transformer架构诞生以来,许多高效的近似自注意力技术已广泛应用。其中两类主流技术是低秩方法和核方法。这两种方法各有优势。我们观察到这些优势具有协同互补性,并利用这种协同作用将低秩方法与核方法相融合,从而产生了一类新的Transformer——FLuRKA(快速低秩与核注意力)。FLuRKA相比这些近似技术实现了显著的性能提升,同时保持高质量。我们从理论和实证两方面评估了FLuRKA的运行时性能和质量。运行时间分析给出了FLuRKA能够实现加速的多种参数配置,而精度分析则界定了FLuRKA相对于全注意力的误差。我们实例化了三种FLuRKA变体,在实际运行中,相较于低秩方法和核方法分别实现了最高3.3倍和1.7倍的加速,这对应着相较于全注意力模型最高30倍的加速。在模型质量方面,FLuRKA在wiki-text 103上预训练后,能够在GLUE基准上匹配低秩方法和核方法的准确率。当在固定时间预算下进行预训练时,FLuRKA相比全注意力模型获得了更好的困惑度分数。