The softmax function is crucial in Transformer attention, which normalizes each row of the attention scores with summation to one, achieving superior performances over other alternative functions. However, the softmax function can face a gradient vanishing issue when some elements of the attention scores approach extreme values, such as probabilities close to one or zero. In this paper, we propose Self-Adjust Softmax (SA-Softmax) to address this issue by modifying $softmax(x)$ to $x \cdot softmax(x)$ and its normalized variant $\frac{(x - min(x_{\min},0))}{max(0,x_{max})-min(x_{min},0)} \cdot softmax(x)$. We theoretically show that SA-Softmax provides enhanced gradient properties compared to the vanilla softmax function. Moreover, SA-Softmax Attention can be seamlessly integrated into existing Transformer models to their attention mechanisms with minor adjustments. We conducted experiments to evaluate the empirical performance of Transformer models using SA-Softmax compared to the vanilla softmax function. These experiments, involving models with up to 2.7 billion parameters, are conducted across diverse datasets, language tasks, and positional encoding methods.
翻译:Softmax函数在Transformer注意力机制中至关重要,它通过对注意力分数矩阵的每一行进行归一化(求和为一),从而取得了优于其他替代函数的性能表现。然而,当注意力分数的某些元素趋近于极端值(例如概率接近1或0)时,Softmax函数可能面临梯度消失问题。本文提出自调整Softmax(SA-Softmax)来解决该问题,其方法是将$softmax(x)$修改为$x \cdot softmax(x)$及其归一化变体$\frac{(x - min(x_{\min},0))}{max(0,x_{max})-min(x_{min},0)} \cdot softmax(x)$。我们从理论上证明了SA-Softmax相比原始Softmax函数具有更优的梯度特性。此外,SA-Softmax注意力机制只需微调即可无缝集成到现有Transformer模型的注意力模块中。我们通过实验评估了使用SA-Softmax的Transformer模型相较于使用原始Softmax函数的实证性能。这些实验在包含多达27亿参数的模型上进行,覆盖了多样化的数据集、语言任务及位置编码方法。