Large language models (LLMs) have demonstrated remarkable proficiency in a wide range of natural language processing applications. However, the high energy and latency overhead induced by the KV cache limits the edge deployment, especially for long contexts. Emerging hybrid bonding (HB) technology has been proposed as a promising alternative to conventional near-memory processing (NMP) architectures, offering improved bandwidth efficiency and lower power consumption while exhibiting characteristics of distributed memory. In this paper, we propose H2EAL, a hybrid bonding-based accelerator with sparse attention algorithm-hardware co-design for efficient LLM inference at the edge. At the algorithm level, we propose a hybrid sparse attention scheme with static and dynamic sparsity for different heads to fully leverage the sparsity with high accuracy. At the hardware level, we co-design the hardware to support hybrid sparse attention and propose memory-compute co-placement to address the distributed memory bottleneck. Since different attention heads exhibit different sparse patterns and the attention structure often mismatches the HB architecture, we further develop a load-balancing scheduler with parallel tiled attention to address workload imbalance and optimize the mapping strategy. Extensive experiments demonstrate H2EAL achieves 5.20~48.21x speedup and 6.22~73.48x energy efficiency improvement over baseline HB implementation, with a negligible average accuracy drop of 0.87% on multiple benchmarks.
翻译:大语言模型(LLMs)在众多自然语言处理应用中展现出卓越的性能。然而,由键值缓存(KV cache)引起的高能耗与高延迟开销限制了其在边缘端的部署,尤其是在处理长上下文时。新兴的混合键合(HB)技术被提出作为传统近内存处理(NMP)架构的一种有前景的替代方案,它在提供更高带宽效率与更低功耗的同时,展现出分布式内存的特性。本文提出H2EAL,一种基于混合键合架构、结合稀疏注意力算法-硬件协同设计的高效边缘端LLM推理加速器。在算法层面,我们提出一种混合稀疏注意力方案,针对不同注意力头采用静态与动态稀疏策略,以在保持高精度的前提下充分利用稀疏性。在硬件层面,我们协同设计硬件以支持混合稀疏注意力,并提出内存-计算协同布局方案以缓解分布式内存瓶颈。由于不同注意力头呈现出不同的稀疏模式,且注意力结构常与HB架构不匹配,我们进一步开发了一种负载均衡调度器,结合并行分块注意力机制,以解决工作负载不均衡问题并优化映射策略。大量实验表明,与基线HB实现相比,H2EAL在多个基准测试上实现了5.20~48.21倍的加速与6.22~73.48倍的能效提升,平均精度损失仅为0.87%,可忽略不计。