Kwai Summary Attention Technical Report

Chenglong Chu,Guorui Zhou,Guowang Zhang,Han Li,Hao Peng,Hongtao Cheng,Jian Liang,Jiangxia Cao,Kun Gai,Lingzhi Zhou,Lu Ren,Qi Zhang,Ruiming Tang,Ruitao Wang,Xinchen Luo,Yi Su,Zhiyuan Liang,Ziqi Wang,Boyang Ding,Chengru Song,Dunju Zang,Hui Wang,Jiao Ou,Jiaxin Deng,Jijun Shi,Jinghao Zhang,Junmin Chen,Lejian Ren,Minxuan Lv,Qianqian Wang,Qigen Hu,Shiyao Wang,Siyang Mao,Tao Wang,Xingmei Wang,Zhixin Ling,Ziming Li,Zixing Zhang

from arxiv, Work in progress

Long-context ability, has become one of the most important iteration direction of next-generation Large Language Models, particularly in semantic understanding/reasoning, code agentic intelligence and recommendation system. However, the standard softmax attention exhibits quadratic time complexity with respect to sequence length. As the sequence length increases, this incurs substantial overhead in long-context settings, leading the training and inference costs of extremely long sequences deteriorate rapidly. Existing solutions mitigate this issue through two technique routings: i) Reducing the KV cache per layer, such as from the head-level compression GQA, and the embedding dimension-level compression MLA, but the KV cache remains linearly dependent on the sequence length at a 1:1 ratio. ii) Interleaving with KV Cache friendly architecture, such as local attention SWA, linear kernel GDN, but often involve trade-offs among KV Cache and long-context modeling effectiveness. Besides the two technique routings, we argue that there exists an intermediate path not well explored: {Maintaining a linear relationship between the KV cache and sequence length, but performing semantic-level compression through a specific ratio $k$}. This $O(n/k)$ path does not pursue a ``minimum KV cache'', but rather trades acceptable memory costs for complete, referential, and interpretable retention of long distant dependency. Motivated by this, we propose Kwai Summary Attention (KSA), a novel attention mechanism that reduces sequence modeling cost by compressing historical contexts into learnable summary tokens.

翻译：长上下文能力已成为下一代大语言模型最重要的迭代方向之一，尤其在语义理解/推理、代码智能体和推荐系统中。然而，标准 softmax 注意力机制在序列长度上呈现二次复杂度。随着序列长度增加，长上下文场景中的计算开销急剧增长，导致超长序列的训练和推理成本迅速恶化。现有解决方案通过两条技术路径缓解该问题：i) 降低每层的 KV 缓存，例如通过头级压缩 GQA 和嵌入维度级压缩 MLA，但 KV 缓存与序列长度仍保持1:1的线性依赖关系；ii) 采用与 KV 缓存友好的架构交替，例如局部注意力 SWA、线性核 GDN，但通常需要在 KV 缓存与长上下文建模效果之间进行权衡。除上述两条技术路径外，我们认为存在一条尚未充分探索的中间路径：{维持 KV 缓存与序列长度的线性关系，但通过特定比率 $k$ 执行语义级压缩}。该 $O(n/k)$ 路径并不追求"最小 KV 缓存"，而是以可接受的内存代价换取对长距离依赖的完整、可参考且可解释的保留。受此启发，我们提出 Kwai Summary Attention (KSA)，一种通过将历史上下文压缩为可学习的摘要标记来降低序列建模成本的新型注意力机制。