Edge acceleration for large language models is crucial for their widespread application; however, achieving fast attention inference and efficient decoding on resource-constrained edge accelerators remains challenging. This paper presents SwiftKV Attention, a per-token pipelined, low-latency single-pass attention inference algorithm, where every (kt, vt) in the KV cache is processed exactly once in a uniform per-token pipeline without score materialization, blockwise softmax, or a second pass, thereby enabling fast execution on edge accelerators with a single hardware set and no resource-intensive parallelism. Furthermore, to address the limited support for multi-head LLM decoding in existing accelerators, we design the SwiftKV-MHA accelerator, which enables high precision attention and low precision GEMV on the same processor array, achieving fast and efficient multi-head parallel decoding. Experimental results show that, on the edge accelerator, the SwiftKV Attention algorithm achieves a 7.16* speedup over native attention and significantly outperforms other attention algorithms. SwiftKV-MHA further reduces attention latency by 13.48*; under the same settings, it improves generation speed by 17.4% and increases token efficiency by 1.98* compared with state-of-the-art works.
翻译:大型语言模型的边缘加速对其广泛应用至关重要;然而,在资源受限的边缘加速器上实现快速注意力推理和高效解码仍具挑战。本文提出SwiftKV注意力算法,这是一种面向令牌的流水线化、低延迟单次注意力推理算法,其中键值缓存中的每个(kt, vt)在统一的每令牌流水线中仅被处理一次,无需分数物化、分块softmax或二次遍历,从而能在仅使用单一硬件集且无需资源密集型并行化的边缘加速器上实现快速执行。此外,针对现有加速器对多头大语言模型解码支持有限的问题,我们设计了SwiftKV-MHA加速器,该加速器能在同一处理器阵列上实现高精度注意力和低精度广义矩阵向量乘法,从而实现快速高效的多头并行解码。实验结果表明,在边缘加速器上,SwiftKV注意力算法相比原生注意力实现了7.16倍的加速,并显著优于其他注意力算法。SwiftKV-MHA进一步将注意力延迟降低13.48倍;在相同设置下,与最先进方案相比,其生成速度提升17.4%,令牌效率提高1.98倍。