This paper questions whether the strong performance of softmax attention in transformers stems from producing a probability distribution over inputs. Instead, we argue that softmax's effectiveness lies in its implicit regularization of the Frobenius norm of the attention matrix, which stabilizes training. Motivated by this, we explore alternative activations, specifically polynomials, that achieve a similar regularization effect. Our theoretical analysis shows that certain polynomials can serve as effective substitutes for softmax, achieving strong performance across transformer applications despite violating softmax's typical properties of positivity, normalization, and sparsity. Extensive experiments support these findings, offering a new perspective on attention mechanisms.
翻译:本文质疑Transformer中Softmax注意力机制的优异性能是否源于其对输入产生概率分布的特性。相反,我们认为Softmax的有效性在于其对注意力矩阵Frobenius范数的隐式正则化作用,这种作用能稳定训练过程。基于此洞见,我们探索了具有类似正则化效果的替代激活函数,特别是多项式函数。理论分析表明,某些多项式函数可作为Softmax的有效替代方案,尽管其违背了Softmax典型的正定性、归一性和稀疏性特征,但在各类Transformer应用中仍能实现强劲性能。大量实验验证了这些发现,为理解注意力机制提供了新的视角。