We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M and 1.8B parameter models demonstrate that softpick achieves 0\% sink rate consistently. The softpick transformers produce hidden states with significantly lower kurtosis and creates sparse attention maps. Quantized models using softpick outperform softmax on standard benchmarks, with a particularly pronounced advantage at lower bit precisions. Our analysis and discussion shows how softpick has the potential to open new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability. Code: https://github.com/zaydzuhri/softpick-attention.
翻译:本文提出softpick,一种修正的、非归一化的即插即用替代方案,用于Transformer注意力机制中的softmax,旨在消除注意力汇聚与大规模激活问题。我们在3.4亿和18亿参数模型上的实验表明,softpick能够持续实现0%的汇聚率。采用softpick的Transformer模型生成的隐藏状态具有显著更低的峰度,并产生稀疏的注意力分布图。使用softpick的量化模型在标准基准测试中表现优于基于softmax的模型,尤其在较低比特精度下优势更为明显。我们的分析与讨论揭示了softpick如何在量化、低精度训练、稀疏性优化、剪枝及可解释性等方面开辟新的可能性。代码地址:https://github.com/zaydzuhri/softpick-attention。