The widespread adoption of Large Language Models (LLMs) has exponentially increased the demand for efficient serving systems. With growing requests and context lengths, key-value (KV)-related operations, including attention computation and KV cache storage, have emerged as critical bottlenecks. They require massive memory bandwidth and capacity. Unfortunately, existing LLM serving systems, optimized for compute-bound workloads, fail to handle these memory-intensive operations effectively. Even with Processing-In-Memory (PIM) technology, current single-level memory designs cannot simultaneously satisfy the bandwidth and capacity requirements. To address these challenges, we propose Processing Across Memory (PAM), a KV-centric LLM serving system that coordinates heterogeneous PIM-enabled memory devices within a hierarchical architecture. PAM introduces a novel computing paradigm to balance high memory bandwidth with scalable capacity. First, PAM exploits the inherent context locality in KV access patterns to intelligently distribute KV tokens across the memory hierarchy. Second, to further exploit context locality, it introduces the PAMattention algorithm, enabling fine-grained parallel attention computation across heterogeneous PIM devices. Finally, PAM incorporates an intra-device KV mapping, inter-device KV migration interface, and an inter-device online KV scheduling algorithm to dynamically balance computational workloads. By addressing both bandwidth and capacity demands simultaneously, PAM significantly enhances the efficiency and scalability of LLM serving systems, paving the way for cost-effective, high-performance solutions in the era of large-scale AI.
翻译:大语言模型(LLM)的广泛采用使得对高效服务系统的需求呈指数级增长。随着请求数量和上下文长度的不断增加,与键值(KV)相关的操作——包括注意力计算和KV缓存存储——已成为关键性能瓶颈。这些操作需要巨大的内存带宽和容量。然而,现有针对计算密集型负载优化的LLM服务系统难以有效处理这些内存密集型操作。即使采用内存内处理(PIM)技术,当前的单级内存设计也无法同时满足带宽和容量需求。为应对这些挑战,我们提出跨内存处理(PAM),这是一个以KV为中心的LLM服务系统,通过在层次化架构中协调异构的PIM内存设备来实现高效服务。PAM引入了一种新颖的计算范式,以平衡高内存带宽与可扩展容量。首先,PAM利用KV访问模式中固有的上下文局部性,智能地将KV令牌分布到内存层次结构中。其次,为进一步利用上下文局部性,系统提出了PAMattention算法,支持在异构PIM设备间进行细粒度并行注意力计算。最后,PAM整合了设备内KV映射、设备间KV迁移接口以及设备间在线KV调度算法,以动态平衡计算负载。通过同时满足带宽和容量需求,PAM显著提升了LLM服务系统的效率与可扩展性,为大规模人工智能时代实现高性价比、高性能的解决方案开辟了道路。