MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs

Haoran Wu,Zeyu Cao,Yao Lai,Binglei Lou,Jiayi Nie,Can Xiao,Timi Adeniran,Przemyslaw Forys,Kauser Johar,Catriona Wright,Junyi Liu,Kai Shi,Nicholas D. Lane,Rika Antonova,Jianyi Cheng,Timothy Jones,Aaron Zhao,Robert Mullins

Emerging agentic LLM workloads are driving rapidly growing demand on both memory capacity and bandwidth, with different phases of inference (e.g., prefill and decode) imposing distinct requirements. Industry is responding by composing heterogeneous accelerators into single interconnected systems, as exemplified by NVIDIA's Vera Rubin platform, where each device brings its own memory architecture. This heterogeneity is further compounded by a widening landscape of available memory technologies: high-density on-chip SRAM, HBM, LPDDR, GDDR, and emerging options such as high-bandwidth flash (HBF), each offering different capacity, bandwidth, and power trade-offs. Identifying the right memory architecture for next-generation inference accelerators requires navigating a vast and rapidly evolving design space, in which the interplay between workload characteristics, NPU design dimensions, and memory system design remains largely underexplored. To address this challenge, we present MemExplorer, a new memory system synthesizer for heterogeneous NPU systems. MemExplorer provides a unified abstraction for modeling diverse memory technologies across different hierarchy levels (e.g., on-chip and off-chip) and automatically determines an efficient heterogeneous memory system together with NPU design choices (e.g., matrix engine size) to balance throughput and power between prefilling and decoding devices in a multi-device NPU system. Experimental results show that, under the same power budget for agentic workloads, MemExplorer achieves up to 2.3x higher energy efficiency than the baseline NPU and 3.23x higher than H100 in the prefill-only setting. Under equivalent performance targets in the decode setting, it further delivers up to 1.93x and 2.72x higher power efficiency over the baseline NPU and H100, respectively.

翻译：新兴的智能LLM工作负载对内存容量和带宽的需求持续快速增长，而推理的不同阶段（如预填充和解码）对内存特征的要求也有所不同。行业正通过将异构加速器组成单个互连系统来应对这一趋势，例如NVIDIA的Vera Rubin平台，其中每个设备都带有自己的内存架构。可用内存技术的多样化进一步加剧了这种异构性：高密度片上SRAM、HBM、LPDDR、GDDR以及新兴的高带宽闪存（HBF）等选项，各自提供不同的容量、带宽和功耗权衡。为下一代推理加速器确定合适的内存架构需要在庞大且快速演进的设计空间中进行探索，其中工作负载特征、NPU设计维度和内存系统设计之间的相互作用在很大程度上尚未被充分研究。为了解决这一挑战，我们提出了MemExplorer，这是一个面向异构NPU系统的新型内存系统合成器。MemExplorer提供了统一的抽象模型，用于对不同层级（如片上片外）的多种内存技术进行建模，并自动确定高效的异构内存系统及NPU设计选择（如矩阵引擎大小），以平衡多设备NPU系统中预填充和解码设备之间的吞吐量和功耗。实验结果表明，在相同的智能工作负载功耗预算下，MemExplorer在仅预填充场景下相比基线NPU实现了高达2.3倍的能效提升，相比H100提升了3.23倍。在解码场景下满足等效性能目标时，它相比基线NPU和H100分别实现了高达1.93倍和2.72倍的功耗效率提升。