The quadratic computational complexity of softmax transformers has become a bottleneck in long-context scenarios. In contrast, linear attention model families provide a promising direction towards a more efficient sequential model. These linear attention models compress past KV values into a single hidden state, thereby efficiently reducing complexity during both training and inference. However, their expressivity remains limited by the size of their hidden state. Previous work proposed interleaving softmax and linear attention layers to reduce computational complexity while preserving expressivity. Nevertheless, the efficiency of these models remains bottlenecked by their softmax attention layers. In this paper, we propose Neural Attention Search Linear (NAtS-L), a framework that applies both linear attention and softmax attention operations within the same layer on different tokens. NAtS-L automatically determines whether a token can be handled by a linear attention model, i.e., tokens that have only short-term impact and can be encoded into fixed-size hidden states, or require softmax attention, i.e., tokens that contain information related to long-term retrieval and need to be preserved for future queries. By searching for optimal Gated DeltaNet and softmax attention combinations across tokens, we show that NAtS-L provides a strong yet efficient token-level hybrid architecture.
翻译:softmax Transformer 的二次计算复杂度已成为长上下文场景中的瓶颈。相比之下,线性注意力模型族为实现更高效的序列模型提供了一个有前景的方向。这些线性注意力模型将过去的键值对压缩为单个隐藏状态,从而在训练和推理过程中有效降低复杂度。然而,其表达能力仍受限于隐藏状态的大小。先前的研究提出交错使用 softmax 和线性注意力层,以在保持表达能力的同时降低计算复杂度。尽管如此,这些模型的效率仍受其 softmax 注意力层的制约。本文提出神经注意力搜索线性(NAtS-L),一个在同一层内对不同令牌同时应用线性注意力与 softmax 注意力操作的框架。NAtS-L 自动判断一个令牌是可由线性注意力模型处理(即仅具有短期影响且可编码为固定大小隐藏状态的令牌),还是需要 softmax 注意力(即包含与长期检索相关的信息且需为未来查询保留的令牌)。通过在令牌间搜索最优的门控 DeltaNet 与 softmax 注意力组合,我们证明 NAtS-L 提供了一种强大且高效的令牌级混合架构。