Efficient long-context language modeling remains a significant challenge in Natural Language Processing (NLP). While Transformers dominate language tasks, they struggle with long sequences due to quadratic computational complexity in training and linearly scaling memory costs during inference. Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval. We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs). These SALs identify tokens requiring long-range interactions, remove less important features, and then augment their representations using the attention module. This approach balances Mamba's efficiency with Transformer-like performance in memory-intensive tasks. By constraining the attention budget, Taipan extends accurate predictions to context lengths of up to 1 million tokens while preserving computational efficiency. Our experiments demonstrate Taipan's superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.
翻译:高效的长上下文语言建模仍然是自然语言处理(NLP)领域的一项重大挑战。尽管Transformer在语言任务中占据主导地位,但由于其在训练中具有二次计算复杂度,且在推理过程中内存成本呈线性增长,因此难以处理长序列。最近的状态空间模型(如Mamba)提供了具有恒定内存使用量的替代方案,但在需要大量上下文检索的任务中表现欠佳。我们提出了Taipan,一种新颖的混合架构,它将Mamba-2与选择性注意力层(SALs)相结合。这些SALs能够识别需要长距离交互的标记,移除重要性较低的特征,然后利用注意力模块增强其表征。这种方法在内存密集型任务中,平衡了Mamba的效率与类Transformer的性能。通过约束注意力预算,Taipan将精确预测的上下文长度扩展到高达100万个标记,同时保持了计算效率。我们的实验表明,Taipan在各种规模和任务上均表现出优越的性能,为高效的长上下文语言建模提供了一个有前景的解决方案。