This paper challenges the conventional belief that softmax attention in transformers is effective primarily because it generates a probability distribution for attention allocation. Instead, we theoretically show that its success lies in its ability to implicitly regularize the Frobenius norm of the attention matrix during training. We then explore alternative activations that regularize the Frobenius norm of the attention matrix, demonstrating that certain polynomial activations can achieve this effect, making them suitable for attention-based architectures. Empirical results indicate these activations perform comparably or better than softmax across various computer vision and language tasks, suggesting new possibilities for attention mechanisms beyond softmax.
翻译:本文挑战了传统观点,即Transformer中的softmax注意力之所以有效,主要因为它为注意力分配生成了概率分布。相反,我们从理论上证明其成功之处在于训练过程中能够隐式正则化注意力矩阵的Frobenius范数。随后,我们探索了能够正则化注意力矩阵Frobenius范数的替代激活函数,并证明某些多项式激活函数可实现此效果,使其适用于基于注意力的架构。实验结果表明,在各种计算机视觉和语言任务中,这些激活函数的性能与softmax相当或更优,这为超越softmax的注意力机制提供了新的可能性。