Self-attention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often developed by heuristics and experience. To provide a principled framework for constructing attention layers in transformers, we show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem, whose primal formulation has the form of a neural network layer. Using our framework, we derive popular attention layers used in practice and propose two new attentions: 1) the Batch Normalized Attention (Attention-BN) derived from the batch normalization layer and 2) the Attention with Scaled Head (Attention-SH) derived from using less training data to fit the SVR model. We empirically demonstrate the advantages of the Attention-BN and Attention-SH in reducing head redundancy, increasing the model's accuracy, and improving the model's efficiency in a variety of practical applications including image and time-series classification.
翻译:自注意力机制是Transformer在序列建模任务中取得显著成功的关键,这些任务包括自然语言处理和计算机视觉中的诸多应用。与神经网络层类似,这些注意力机制通常基于启发式方法和经验而开发。为了给构建Transformer中的注意力层提供一个原则性框架,我们证明了自注意力机制对应于从一个支持向量回归问题推导出的支持向量展开式,而该问题的原始形式具有神经网络层的形式。利用我们的框架,我们推导了实践中常用的流行注意力层,并提出了两种新的注意力机制:1)从批归一化层推导出的批归一化注意力(Attention-BN);2)通过使用较少训练数据拟合SVR模型推导出的缩放头注意力(Attention-SH)。我们通过实验证明了Attention-BN和Attention-SH在减少头部冗余、提高模型准确性以及提升模型效率方面的优势,这些优势在包括图像和时间序列分类在内的多种实际应用中得到验证。