Large language models (LLMs) have been widely deployed for online generative services, where numerous LLM instances jointly handle workloads with fluctuating request arrival rates and variable request lengths. To efficiently execute coexisting compute-intensive and memory-intensive operators, near-memory processing (NMP) based computing paradigm has been extensively proposed. However, existing NMP designs adopt coarse-grained KV cache management and inflexible attention execution flow. Such limitations hinder these proposals from efficiently handling \textit{highly dynamic} LLM serving workloads, limiting their ability to accelerate LLM serving. To tackle these problems, we propose Helios, a Hybrid-bonding-based \uline{L}LM \uline{S}erving accelerator. Helios aims to bridge the fundamental gap between the dynamic nature of KV cache management in LLM serving and the distributed, non-uniform memory abstraction among NMP processing engines (PEs). To this end, we design both the intra-PE execution flow and the inter-PE communication primitives for distributed tiled attention execution. We further propose \textit{spatially-aware} KV cache allocation mechanism to balance the attention workload distribution while minimizing the inter-PE data transfer overhead. Compared with existing GPU/NMP designs, Helios achieves 3.25 times (geomean) speedup and 3.36 times (geomean) better energy efficiency, along with up to 72%/76% P50/P99 time-between-tokens degradation.
翻译:大语言模型(LLM)已被广泛部署于在线生成服务中,众多LLM实例共同处理请求到达率波动且请求长度可变的工作负载。为高效执行共存的计算密集型与内存密集型算子,基于近内存处理(NMP)的计算范式已被广泛提出。然而,现有NMP设计采用粗粒度的KV缓存管理及不灵活的注意力执行流程。这些限制阻碍了此类方案高效处理\textit{高度动态}的LLM服务负载,限制了其加速LLM服务的能力。为解决这些问题,我们提出Helios,一种基于混合键合的LLM服务加速器。Helios旨在弥合LLM服务中KV缓存管理的动态本质与NMP处理引擎(PE)间分布式、非均匀内存抽象之间的根本差距。为此,我们设计了用于分布式分块注意力执行的PE内部执行流程与PE间通信原语。我们进一步提出\textit{空间感知}的KV缓存分配机制,以平衡注意力工作负载分布,同时最小化PE间数据传输开销。与现有GPU/NMP设计相比,Helios实现了3.25倍(几何平均)的加速比和3.36倍(几何平均)的能效提升,同时将P50/P99 token间时间延迟降低高达72%/76%。