Existing LLM training and inference frameworks struggle in boosting efficiency with sparsity while maintaining the integrity of context and model architecture. Inspired by the sharding concept in database and the fact that attention parallelizes over heads on accelerators, we propose Sparsely-Sharded (S2) Attention, an attention algorithm that allocates heterogeneous context partitions for different attention heads to divide and conquer. S2-Attention enforces each attention head to only attend to a partition of contexts following a strided sparsity pattern, while the full context is preserved as the union of all the shards. As attention heads are processed in separate thread blocks, the context reduction for each head can thus produce end-to-end speed-up and memory reduction. At inference, LLMs trained with S2-Attention can then take the KV cache reduction as free meals with guaranteed model quality preserve. In experiments, we show S2-Attentioncan provide as much as (1) 25.3X wall-clock attention speed-up over FlashAttention-2, resulting in 6X reduction in end-to-end training time and 10X inference latency, (2) on-par model training quality compared to default attention, (3)perfect needle retrieval accuracy over 32K context window. On top of the algorithm, we build DKernel, an LLM training and inference kernel library that allows users to customize sparsity patterns for their own models. We open-sourced DKerneland make it compatible with Megatron, Pytorch, and vLLM.
翻译:现有的大语言模型训练与推理框架在利用稀疏性提升效率的同时,难以兼顾上下文完整性与模型架构的保持。受数据库分片概念以及注意力机制在加速器上可按注意力头并行计算的启发,我们提出稀疏分片注意力算法,该算法为不同的注意力头分配异构的上下文分区,实现分而治之。S2-Attention强制每个注意力头仅关注遵循跨步稀疏模式的上下文分区,而完整上下文则作为所有分区的并集得以保留。由于注意力头在独立的线程块中进行处理,每个头的上下文缩减因此能够带来端到端的加速与内存占用降低。在推理阶段,采用S2-Attention训练的模型可无损地获得键值缓存缩减带来的性能收益。实验表明,S2-Attention能够实现:(1)相比FlashAttention-2最高达25.3倍的注意力计算墙钟加速,带来端到端训练时间6倍的缩减与推理延迟10倍的降低;(2)与标准注意力机制相当的模型训练质量;(3)在32K上下文窗口上实现完美的细粒度信息检索准确率。基于该算法,我们构建了DKernel——一个大语言模型训练与推理内核库,允许用户为自有模型定制稀疏模式。我们已开源DKernel,并使其兼容Megatron、PyTorch及vLLM框架。