Linear attention transformers have become a strong alternative to softmax attention due to their efficiency. However, linear attention tends to be less expressive and results in reduced accuracy compared to softmax attention. To bridge the accuracy gap between softmax attention and linear attention, we manipulate Mamba-2, a very strong linear attention variant. We first simplify Mamba-2 down to its most fundamental and important components, evaluating which specific choices make it most accurate. From this simplified Mamba variant (Mamba-2S), we improve the A-mask and increase the order of the hidden state, resulting in a method, which we call 2Mamba, that is nearly as accurate as softmax attention, yet much more memory efficient for long context lengths. We also investigate elements to Mamba-2 that help surpass softmax attention accuracy. Code is provided for all our experiments
翻译:线性注意力Transformer因其高效性已成为softmax注意力的有力替代方案。然而,与softmax注意力相比,线性注意力往往表达能力较弱,导致精度下降。为了弥合softmax注意力与线性注意力之间的精度差距,我们对一种非常强大的线性注意力变体Mamba-2进行了改进。我们首先将Mamba-2简化为其最基础和最重要的组成部分,评估哪些具体设计选择使其精度最高。基于这个简化的Mamba变体(Mamba-2S),我们改进了A-mask并提高了隐藏状态的阶数,从而得到一种我们称之为2Mamba的方法。该方法在精度上几乎与softmax注意力相当,同时在长上下文长度下内存效率更高。我们还研究了Mamba-2中能够帮助其超越softmax注意力精度的要素。我们为所有实验提供了代码。