Pretraining transformers on long sequences (entire code repositories, collections of related documents) is bottlenecked by quadratic attention costs. We present Multipole Semantic Attention (MuSe), which accelerates 64k-context pretraining by 36% while matching baseline loss, requiring no architectural changes. MuSe clusters queries and keys separately in representation space. This yields query-specific summaries that substantially outperform spatial blocking at matched sparsity, while also enabling drop-in compatibility with existing pretrained models; we validate on Llama 3.1-8B and 3.2-1B without retraining. We pretrain language models up to 1B parameters at 64k context on code and scientific documents, confirming that MuSe preserves quality and long-context utilization during training.
翻译:在长序列(完整代码仓库、相关文档集合)上预训练Transformer模型时,二次注意力计算成本成为主要瓶颈。本文提出多极语义注意力(MuSe),该方法在匹配基线损失的前提下,将64k上下文长度的预训练速度提升36%,且无需改变模型架构。MuSe在表示空间中分别对查询向量和键向量进行聚类,从而生成查询特定的摘要表示。该方法在同等稀疏度条件下显著优于空间分块策略,同时能够直接兼容现有预训练模型;我们在未重新训练的情况下,于Llama 3.1-8B和3.2-1B模型上验证了其有效性。我们在代码和科学文档数据上对参数规模达10亿的模型进行了64k上下文长度的预训练,证实MuSe在训练过程中能够保持模型质量与长上下文利用能力。