Large language model (LLM) inference performance is increasingly bottlenecked by the memory wall. While GPUs continue to scale raw compute throughput, they struggle to deliver scalable performance for memory bandwidth bound workloads. This challenge is amplified by emerging reasoning LLM applications, where long output sequences, low arithmetic intensity, and tight latency constraints demand significantly higher memory bandwidth. As a result, system utilization drops and energy per inference rises, highlighting the need for an optimized system architecture for scalable memory bandwidth. To address these challenges we present the Reasoning Processing Unit (RPU), a chiplet-based architecture designed to address the challenges of the modern memory wall. RPU introduces: (1) A Capacity-Optimized High-Bandwidth Memory (HBM-CO) that trades capacity for lower energy and cost; (2) a scalable chiplet architecture featuring a bandwidth-first power and area provisioning design; and (3) a decoupled microarchitecture that separates memory, compute, and communication pipelines to sustain high bandwidth utilization. Simulation results show that RPU performs up to 45.3x lower latency and 18.6x higher throughput over an H100 system at ISO-TDP on Llama3-405B.
翻译:大型语言模型(LLM)的推理性能日益受到内存墙的瓶颈制约。尽管GPU持续提升原始计算吞吐量,但对于内存带宽受限的工作负载,其性能难以实现可扩展性。这一挑战在日益兴起的推理型LLM应用中进一步加剧:长输出序列、低算术强度以及严格的延迟约束,均对内存带宽提出了显著更高的要求。这导致系统利用率下降、单次推理能耗上升,凸显了为可扩展内存带宽设计优化系统架构的必要性。为应对这些挑战,我们提出了推理处理单元(RPU),这是一种基于芯粒的架构,旨在解决现代内存墙带来的难题。RPU引入了:(1)容量优化的高带宽内存(HBM-CO),通过牺牲容量以降低能耗与成本;(2)可扩展的芯粒架构,采用带宽优先的功耗与面积分配设计;(3)解耦的微架构,将内存、计算和通信流水线分离,以维持高带宽利用率。仿真结果表明,在ISO-TDP条件下,对于Llama3-405B模型,RPU相比H100系统可实现高达45.3倍的延迟降低和18.6倍的吞吐量提升。