MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

MiniCPM Team,Wenhao An,Yingfa Chen,Yewei Fang,Jiayi Li,Xin Li,Yaohui Li,Yishan Li,Yuxuan Li,Biyuan Lin,Chuan Liu,Hezi Liu,Siyuan Liu,Hongya Lyu,Yinxu Pan,Shixin Ren,Xingyu Shen,Zhou Su,Haojun Sun,Yangang Sun,Zhen Leng Thai,Xin Tian,Rui Wang,Xiaorong Wang,Yudong Wang,Bo Wu,Xiaoyue Xu,Dong Xu,Shuaikang Xue,Jiawei Yang,Bowen Zhang,Jinqian Zhang,Letian Zhang,Shengnan Zhang,Xinyu Zhang,Xinyuan Zhang,Zhu Zhang,Hengyu Zhao,Jiacheng Zhao,Jie Zhou,Zihan Zhou,Shuo Wang,Chaojun Xiao,Xu Han,Zhiyuan Liu,Maosong Sun

from arxiv, MiniCPM-SALA Technical Report

The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While existing sparse and linear attention mechanisms attempt to mitigate these issues, they typically involve a trade-off between memory efficiency and model performance. This paper introduces MiniCPM-SALA, a 9B-parameter hybrid architecture that integrates the high-fidelity long-context modeling of sparse attention (InfLLM-V2) with the global efficiency of linear attention (Lightning Attention). By employing a layer selection algorithm to integrate these mechanisms in a 1:3 ratio and utilizing a hybrid positional encoding (HyPE), the model maintains efficiency and performance for long-context tasks. Furthermore, we introduce a cost-effective continual training framework that transforms pre-trained Transformer-based models into hybrid models, which reduces training costs by approximately 75% compared to training from scratch. Extensive experiments show that MiniCPM-SALA maintains general capabilities comparable to full-attention models while offering improved efficiency. On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens and supports context lengths of up to 1M tokens, a scale where traditional full-attention 8B models fail because of memory constraints.

翻译：大型语言模型（LLM）向超长上下文应用演进时，面临Transformer架构高昂计算与内存成本的挑战。现有稀疏注意力与线性注意力机制虽试图缓解此问题，但通常需在内存效率与模型性能间进行权衡。本文提出MiniCPM-SALA，一种90亿参数的混合架构，它融合了稀疏注意力（InfLLM-V2）的高保真长上下文建模能力与线性注意力（Lightning Attention）的全局效率。通过采用层选择算法以1:3比例集成这两种机制，并利用混合位置编码（HyPE），该模型在长上下文任务中保持了效率与性能。此外，我们提出一种经济高效的持续训练框架，可将预训练的基于Transformer的模型转化为混合模型，相比从头训练，其训练成本降低约75%。大量实验表明，MiniCPM-SALA在保持与全注意力模型相当通用能力的同时，提供了更优的效率。在单张NVIDIA A6000D GPU上，该模型在256K令牌序列长度下推理速度可达全注意力模型的3.5倍，并支持高达100万令牌的上下文长度——这一尺度下，传统的全注意力80亿参数模型会因内存限制而无法运行。