AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

Zhongkai Yu,Haotian Ye,Chenyang Zhou,Ohm Rishabh Venkatachalam,Zaifeng Pan,Zhengding Hu,Junsung Kim,Won Woo Ro,Po-An Tsai,Shuyi Pei,Yangwook Kang,Yufei Ding

All current LLM serving systems place the GPU at the center, from production-level attention-FFN disaggregation to NVIDIA's Rubin GPU-LPU heterogeneous platform. Even academic PIM/PNM proposals still treat the GPU as the central hub for cross-device communication. Yet the GPU's compute-rich architecture is fundamentally mismatched with the memory-bound nature of decode-phase attention, inflating serving latency while wasting power and die area on idle compute units. The problem is compounded as reasoning and agentic workloads push context lengths toward one million tokens, making attention latency the primary user-facing bottleneck. To address these inefficiencies, we present AMMA, a multi-chiplet, memory-centric architecture for low-latency long-context attention. AMMA replaces GPU compute dies with HBM-PNM cubes, roughly doubling the available memory bandwidth to better serve memory-bound attention workloads. To translate this bandwidth into proportional performance gains, we introduce (i) a logic-die microarchitecture that fully exploits per-cube internal bandwidth for decode attention under a minimal power and area budget, (ii) a two-level hybrid parallelism scheme, and (iii) a reordered collective flow that reduces intra-chip die-to-die communication overhead. We further conduct a design-space exploration over per-cube compute power and intra-chip D2D link bandwidth, providing actionable guidance for hardware designers. Evaluations show that AMMA achieves 15.5X lower attention latency and 6.9X lower energy consumption compared with the NVIDIA H100.

翻译：所有当前的LLM服务系统均以GPU为中心，从生产级别的注意力-前馈网络解耦架构到NVIDIA的Rubin GPU-LPU异构平台均如此。即便是学术界的PIM/PNM方案，仍将GPU视为跨设备通信的核心枢纽。然而GPU高计算密度的架构从根本上与解码阶段注意力机制的存储密集型特性不匹配，导致服务延迟增加，同时空闲计算单元浪费功耗与芯片面积。随着推理与代理型工作负载将上下文长度推向百万级标记，注意力延迟已成为用户感知的主要瓶颈。针对上述低效问题，我们提出AMMA——一种面向低延迟长上下文注意力服务的多芯粒存算一体架构。AMMA用HBM-PNM立方体替代GPU计算芯粒，使可用内存带宽近乎翻倍，以更好地服务于存储密集型的注意力工作负载。为将带宽优势转化为相应的性能提升，我们提出：(i) 一种以最小功耗与面积预算充分开发每个立方体内部解码注意力带宽的逻辑芯粒微架构；(ii) 一种两级混合并行方案；(iii) 一种优化芯粒间通信开销的重排序集合流。我们进一步对每立方体计算能力与片内芯粒间链路带宽进行了设计空间探索，为硬件设计者提供可操作的指导。评估表明，与NVIDIA H100相比，AMMA实现了15.5倍的注意力延迟降低和6.9倍的能量消耗降低。