Transformers are the cornerstone of modern large language models, but their quadratic computational complexity limits efficiency in long-sequence processing. Recent advancements in Mamba, a state space model (SSM) with linear complexity, offer promising efficiency gains but suffer from unstable contextual learning and multitask generalization. This paper proposes TransMamba, a novel framework that unifies Transformer and Mamba through shared parameter matrices (e.g., QKV and CBx), and thus could dynamically switch between attention and SSM mechanisms at different token lengths and layers. We design the Memory converter to bridge Transformer and Mamba by converting attention outputs into SSM-compatible states, ensuring seamless information flow at TransPoints where the transformation happens. The TransPoint scheduling is also thoroughly explored for further improvements. We conducted extensive experiments demonstrating that TransMamba achieves superior training efficiency and performance compared to baselines, and validated the deeper consistency between Transformer and Mamba paradigms, offering a scalable solution for next-generation sequence modeling.
翻译:Transformer是现代大语言模型的基石,但其二次计算复杂度限制了长序列处理效率。近期基于状态空间模型(SSM)的Mamba模型以线性复杂度展现出显著效率优势,但其上下文学习稳定性与多任务泛化能力存在不足。本文提出TransMamba——一种通过共享参数矩阵(如QKV与CBx)统一Transformer与Mamba的创新框架,使其能在不同词元长度与网络层间动态切换注意力机制与SSM机制。我们设计了记忆转换器,通过将注意力输出转换为SSM兼容状态来桥接两种架构,确保在发生模式转换的TransPoints节点实现无缝信息流动。本文还深入探索了TransPoint调度策略以进一步提升性能。大量实验表明,TransMamba在训练效率与性能上均优于基线模型,同时验证了Transformer与Mamba范式间更深层的统一性,为新一代序列建模提供了可扩展的解决方案。