MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation

Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity- and access-preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with fresh attention computed on the KV tail through a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of context length. The method is model-agnostic and composes with IO-aware kernels, paged-KV managers, and MQA/GQA. Across LongBench v2 (120K), RULER (120K), and LongGenBench (16K continuous generation), compared to the latest FlashInfer library, MAC-Attention reduces KV accesses by up to 99%, cuts token generation latency by over 60% at 128K, and achieves over 14.3x attention-phase speedups, up to 2.6x end-to-end, while maintaining full-attention quality. By reusing computation, MAC-Attention delivers long-context inference that is both fast and faithful. Code is available here: https://github.com/YJHMITWEB/MAC-Attention.git

翻译：大语言模型中的长上下文解码受限于IO瓶颈：每个词元都需要重新读取不断增长的KV缓存。现有加速方案通过压缩减少字节数会降低保真度，通过选择/淘汰机制则会限制可访问内容，两者均可能损害延迟召回和长文本生成质量。本文提出MAC-Attention，一种保留保真度与可访问性的替代方案，通过复用语义相近的近期查询的注意力计算结果来加速解码。该方案包含三个阶段：匹配阶段在短局部窗口内执行RoPE前的L2匹配；修正阶段通过重计算匹配边界附近的小范围注意力来校正复用结果；补全阶段通过数值稳定的合并操作，将修正结果与基于KV尾部新计算的注意力进行融合。当匹配命中时，计算量和带宽复杂度与上下文长度无关。该方法具有模型无关性，可与IO感知内核、分页KV管理器及MQA/GQA组合使用。在LongBench v2（120K）、RULER（120K）和LongGenBench（16K连续生成）基准测试中，与最新FlashInfer库相比，MAC-Attention最多减少99%的KV访问量，在128K上下文长度下词元生成延迟降低超过60%，注意力阶段加速比达14.3倍以上，端到端加速比达2.6倍，同时保持全注意力的计算质量。通过复用计算，MAC-Attention实现了兼具速度与保真度的长上下文推理。代码开源地址：https://github.com/YJHMITWEB/MAC-Attention.git