Long contexts improve capabilities of large language models but pose serious hardware challenges: compute and memory footprints grow linearly with sequence length. Particularly, the decoding phase continuously accesses massive KV cache, dramatically increasing bandwidth and computing pressure. Existing accelerators are primarily designed and evaluated for short contexts. They suffer from significant performance degradation when processing long contexts. To bridge this gap, we identify the major bottleneck and present a hardware accelerator for long context attention decoding via hardware-software co-design. On the software side, we propose dual-compression dynamic sparse attention. It combines ultra-low-precision quantization with feature sparsity to minimize prediction overhead. A hardware-friendly approximate Top-K selection further reduces filter complexity from $O(n \log k)$ to $O(n)$. On the hardware side, we deeply optimize compute and memory access to tackle bottlenecks from intricate interplay between sparse attention and long contexts, and establish a performance model to derive the optimal co-design scheme. The resulting hardware adopts a fully pipelined parallel architecture and achieves $O(n)$ efficiency even for long sequences. Experiments show that our design delivers $3.82\times$ speedup and $74.19\times$ energy efficiency over A100. Compared to SOTA accelerators, this is the first ASIC accelerator that efficiently supports long context inference, with at least $3.5\times$ higher throughput and $2.08\times$ better energy efficiency.
翻译:摘要:长上下文增强了大语言模型的能力,但带来了严峻的硬件挑战:计算与内存占用随序列长度线性增长。尤其在解码阶段,持续访问大规模KV缓存导致带宽与计算压力急剧增加。现有加速器主要针对短上下文场景设计与评估,在处理长上下文时性能显著下降。针对这一差距,我们识别出主要瓶颈,并通过软硬件协同设计提出一种面向长上下文注意力解码的硬件加速器。在软件层面,我们提出双压缩动态稀疏注意力:融合超低精度量化与特征稀疏性以最小化预测开销。一种硬件友好的近似Top-K选择方法将过滤器复杂度从$O(n \log k)$降至$O(n)$。在硬件层面,我们深度优化计算与内存访问以应对稀疏注意力与长上下文复杂交互带来的瓶颈,并建立性能模型推导最优协同方案。最终硬件采用全流水线并行架构,即使面对长序列也能实现$O(n)$效率。实验表明,本设计相较A100实现$3.82\times$加速比与$74.19\times$能效提升。相较于现有最优加速器,这是首个高效支持长上下文推理的ASIC加速器,吞吐量提升至少$3.5\times$,能效提升$2.08\times$。