The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy. Our code is available at https://aka.ms/MInference.
翻译:大语言模型推理的计算挑战仍然是其广泛部署的主要障碍,尤其是在提示长度持续增加的背景下。由于注意力计算具有二次复杂度,一个80亿参数的大语言模型在单个A100 GPU上处理一个包含100万个标记的提示(即预填充阶段)需要30分钟。现有的加速预填充方法在应用于长上下文大语言模型时,往往难以在保持可接受的精度或效率方面取得平衡。为弥补这一不足,我们提出了MInference(百万标记推理),这是一种专为加速长序列处理的预填充而设计的稀疏计算方法。具体而言,我们识别出长上下文注意力矩阵中的三种独特模式——A形、竖斜线和块稀疏模式,这些模式可用于在GPU上进行高效的稀疏计算。我们为每个注意力头离线确定最优模式,并在推理过程中根据分配的模式动态构建稀疏索引。利用该模式和稀疏索引,我们通过优化的GPU内核执行高效的稀疏注意力计算,从而显著降低长上下文大语言模型预填充阶段的延迟。我们提出的技术可以直接应用于现有的大语言模型,无需对预训练设置进行任何修改或进行额外的微调。通过在广泛的下游任务(包括InfiniteBench、RULER、PG-19和Needle In A Haystack)和模型(包括LLaMA-3-1M、GLM4-1M、Yi-200K、Phi-3-128K和Qwen2-128K)上进行评估,我们证明MInference在A100上能将预填充的推理延迟有效降低高达10倍,同时保持精度。我们的代码可在 https://aka.ms/MInference 获取。