Scaled Dot Product Attention (SDPA) is the backbone of many modern deep-learning models. It is so versatile that it has been used in natural language, vision, and multi-modal domains with very little change compared to its original formulation. This paper discusses why the current formulation is inefficient by delving into the mathematical details of the attention mechanism. We propose three improvements to mitigate these inefficiencies, thereby, introducing three enhanced attention mechanisms: Optimised, Efficient, and Super Attention. Optimised and Efficient Attention have one and two matrix multiplications fewer per head, respectively, and 25% and 50% fewer parameters, respectively, than standard SDPA, but perform similarly to standard SDPA in both vision and natural language tasks. They can be used in all applications where SDPA is used while offering smaller model sizes and faster training and inference without noticeable loss in performance. Super Attention introduces a new linear transformation on the values, transforming them from the left. It outperforms standard SPDA on vision and natural language tasks by up to 17% while having one fewer matrix multiplication per head and 25% fewer parameters than standard SDPA. Consequently, it is also faster than standard SDPA. Super Attention is ideal in applications where the attention layer's context length is fixed, such as Vision Transformers. In addition to providing mathematical reasoning, we evaluate the presented attention mechanisms on several datasets including MNIST, CIFAR100, ImageNet, IMDB Movie Reviews, and Amazon Reviews datasets, as well as combined Europarl and Anki English-Spanish datasets for neural machine translation.
翻译:缩放点积注意力(SDPA)是许多现代深度学习模型的核心组件。其通用性极强,在自然语言处理、视觉及多模态领域的应用几乎未改变其原始形式。本文通过深入分析注意力机制的数学细节,探讨了当前形式为何效率不足。我们提出了三项改进以缓解这些低效问题,从而引入了三种增强的注意力机制:优化注意力、高效注意力与超级注意力。优化注意力和高效注意力分别比标准SDPA每头减少一次和两次矩阵乘法,参数量分别减少25%和50%,但在视觉与自然语言任务中表现与标准SDPA相当。它们可完全替代SDPA应用于现有场景,在保持性能无明显损失的同时,提供更小的模型规模以及更快的训练与推理速度。超级注意力在值向量上引入了新的左乘线性变换,在视觉和自然语言任务中性能最高可超越标准SDPA达17%,同时每头减少一次矩阵乘法且参数量减少25%,因而计算速度也快于标准SDPA。该机制特别适用于注意力层上下文长度固定的场景,如视觉Transformer。除理论推导外,我们在MNIST、CIFAR100、ImageNet、IMDB电影评论、亚马逊评论数据集,以及用于神经机器翻译的Europarl与Anki英西双语组合数据集上对提出的注意力机制进行了全面评估。