MiniMax Sparse Attention

Xunhao Lai,Weiqi Xu,Yufeng Yang,Qiaorui Chen,Yang Xu,Lunbin Zeng,Xiaolong Li,Haohai Sun,Haichao Zhu,Vito Zhang,Jinkai Hu,Jiayao Li,Rui Gao,Zekun Li,Songquan Zhu,Jingkai Zhou,Pengyu Zhao

from arxiv, 30 pages, 14 figures

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention built upon Grouped Query Attention (GQA). A lightweight Index Branch scores key-value blocks and independently selects a Top-k subset for each GQA group, enabling group-specific sparse retrieval while maintaining efficient block-level execution; the Main Branch then performs exact block-sparse attention over only the selected blocks. Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs. To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-token attention compute by 28.4x at 1M context. Paired with our co-designed kernel, MSA achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800. Our inference kernel is available at: https://github.com/MiniMax-AI/MSA. A production-grade natively multimodal model powered by MSA has been publicly released at: https://huggingface.co/MiniMaxAI/MiniMax-M3.

翻译：超长上下文能力正成为前沿大语言模型不可或缺的能力：智能体工作流、仓库级代码推理以及持久记忆均要求模型能够联合处理数十万至数百万个标记，然而softmax注意力的二次复杂度使得其在部署规模下难以实现。我们提出MiniMax稀疏注意力（MSA），一种基于分组查询注意力（GQA）构建的分块稀疏注意力机制。轻量级索引分支对键值块进行评分，并为每个GQA分组独立选取Top-k子集，从而实现分组特定的稀疏检索，同时保持高效的块级执行；主分支则仅对选中的块执行精确的块稀疏注意力。基于简洁性和可扩展性原则，MSA被刻意设计为精简架构，使其能在各类GPU上便捷高效地部署。为将稀疏性转化为实际加速，我们协同设计了MSA的GPU执行路径，该路径采用无指数运算的Top-k选取与KV外部稀疏注意力，在块粒度访问条件下提升张量核心利用率。在具备原生多模态训练的109B参数模型上，MSA性能与GQA持平，同时在1M上下文长度下将每标记注意力计算量降低28.4倍。结合协同设计的核函数，MSA在H800上实现14.2倍预填充加速和7.6倍解码端到端加速。我们的推理核函数已开源：https://github.com/MiniMax-AI/MSA。基于MSA的生产级原生多模态模型已公开发布：https://huggingface.co/MiniMaxAI/MiniMax-M3。