AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

Zhongkai Yu,Haotian Ye,Chenyang Zhou,Ohm Rishabh Venkatachalam,Zaifeng Pan,Zhengding Hu,Junsung Kim,Won Woo Ro,Po-An Tsai,Shuyi Pei,Yangwook Kang,Yufei Ding

All current LLM serving systems place the GPU at the center, from production-level attention-FFN disaggregation to NVIDIA's Rubin GPU-LPU heterogeneous platform. Even academic PIM/PNM proposals still treat the GPU as the central hub for cross-device communication. Yet the GPU's compute-rich architecture is fundamentally mismatched with the memory-bound nature of decode-phase attention, inflating serving latency while wasting power and die area on idle compute units. The problem is compounded as reasoning and agentic workloads push context lengths toward one million tokens, making attention latency the primary user-facing bottleneck. To address these inefficiencies, we present AMMA, a multi-chiplet, memory-centric architecture for low-latency long-context attention. AMMA replaces GPU compute dies with HBM-PNM cubes, roughly doubling the available memory bandwidth to better serve memory-bound attention workloads. To translate this bandwidth into proportional performance gains, we introduce (i) a logic-die microarchitecture that fully exploits per-cube internal bandwidth for decode attention under a minimal power and area budget, (ii) a two-level hybrid parallelism scheme, and (iii) a reordered collective flow that reduces intra-chip die-to-die communication overhead. We further conduct a design-space exploration over per-cube compute power and intra-chip D2D link bandwidth, providing actionable guidance for hardware designers. Evaluations show that AMMA achieves 15.5X lower attention latency and 6.9X lower energy consumption compared with the NVIDIA H100.

翻译：当前所有大语言模型服务系统均以GPU为中心，从生产级的注意力-前馈网络解耦架构到英伟达Rubin GPU-LPU异构平台，甚至学术界提出的PIM/PNM方案仍将GPU视为跨设备通信的核心枢纽。然而，GPU以计算为核心的架构与解码阶段注意力机制的内存受限特性存在根本性矛盾：既导致服务延迟膨胀，又因闲置计算单元造成功耗与芯片面积浪费。随着推理与智能体工作负载将上下文长度推向百万级token，注意力延迟已成为面向用户的首要性能瓶颈。为解决上述低效问题，我们提出AMMA——一种面向低延迟长上下文注意力的多芯粒以内存为中心架构。AMMA用HBM-PNM立方体替代GPU计算晶粒，使可用内存带宽近乎翻倍，从而更好服务于内存受限的注意力工作负载。为将带宽优势转化为比例性能提升，我们引入：(i) 能以极低功耗与面积预算充分挖掘每立方体内部带宽用于解码注意力的逻辑晶粒微架构，(ii) 两级混合并行方案，以及(iii) 可降低片内芯粒间通信开销的重新排序聚合流。进一步，我们针对每立方体计算能力与片内芯粒间链路带宽开展设计空间探索，为硬件设计者提供可落地的指导。评估表明，与英伟达H100相比，AMMA可实现15.5倍注意力延迟降低与6.9倍能耗降低。