Multi-Head Latent Attention (MLA) is a recent attention mechanism adopted in state-of-the-art LLMs such as DeepSeek-v3 and Kimi K2. Thanks to its novel formulation, MLA allows two functionally equivalent but computationally distinct kernel implementations: naive and absorb. While the naive kernels (e.g., FlashAttention) are typically preferred in training and prefill for their computational efficiency, existing decoding kernels (e.g., FlashMLA) rely on the absorb method to minimize HBM bandwidth usage. However, the compute-bound nature of the absorb implementations prohibits performance benefits from data reuse opportunities in attention calculations, such as shared prefixes. In this work, we introduce TyphoonMLA, a hybrid approach that combines naive and absorb formulations to harness the strengths of both. TyphoonMLA effectively leverages the shared prefix by applying the naive formulation to the compute-bound parts of attention calculations, while reducing the bandwidth requirements for non-shared parts by using the absorb formulation. As a result, TyphoonMLA improves the throughput of attention calculations in MLA architectures by up to 3x and 3.24x on NPU and GPUs, with only a 3% overhead in HBM size.
翻译:多头潜在注意力(MLA)是近期在先进大语言模型(如DeepSeek-v3与Kimi K2)中采用的一种注意力机制。凭借其新颖的数学形式,MLA允许两种功能等效但计算方式不同的内核实现:朴素方法与吸收方法。虽然朴素内核(例如FlashAttention)因其计算效率通常在训练和前向填充阶段更受青睐,而现有的解码内核(例如FlashMLA)则依赖吸收方法来最小化高带宽内存(HBM)的带宽占用。然而,吸收实现的计算受限特性阻碍了注意力计算中数据复用机会(如共享前缀)带来的性能收益。本研究提出TyphoonMLA,一种结合朴素与吸收形式的混合方法,以兼取二者之长。TyphoonMLA通过将朴素形式应用于注意力计算中计算受限的部分,有效利用了共享前缀;同时,对非共享部分采用吸收形式以降低带宽需求。实验表明,在NPU和GPU上,TyphoonMLA将MLA架构中注意力计算的吞吐量分别提升至多3倍与3.24倍,而HBM占用仅增加3%。