Large language model (LLM) inference performance is increasingly bottlenecked by the memory wall. While GPUs continue to scale raw compute throughput, they struggle to deliver scalable performance for memory bandwidth bound workloads. This challenge is amplified by emerging reasoning LLM applications, where long output sequences, low arithmetic intensity, and tight latency constraints demand significantly higher memory bandwidth. As a result, system utilization drops and energy per inference rises, highlighting the need for an optimized system architecture for scalable memory bandwidth. To address these challenges we present the Reasoning Processing Unit (RPU), a chiplet-based architecture designed to address the challenges of the modern memory wall. RPU introduces: (1) A Capacity-Optimized High-Bandwidth Memory (HBM-CO) that trades capacity for lower energy and cost; (2) a scalable chiplet architecture featuring a bandwidth-first power and area provisioning design; and (3) a decoupled microarchitecture that separates memory, compute, and communication pipelines to sustain high bandwidth utilization. Simulation results show that RPU performs up to 45.3x lower latency and 18.6x higher throughput over an H100 system at ISO-TDP on Llama3-405B.
翻译:大型语言模型(LLM)的推理性能日益受到内存墙的瓶颈制约。尽管GPU持续提升原始计算吞吐量,但在处理内存带宽受限的工作负载时难以提供可扩展的性能。这一挑战在新兴的推理型LLM应用中进一步加剧,因为长输出序列、低算术强度以及严格的延迟约束对内存带宽提出了显著更高的要求。这导致系统利用率下降、单次推理能耗上升,凸显了需要针对可扩展内存带宽进行优化的系统架构。为应对这些挑战,我们提出推理处理单元(RPU),一种基于小芯片的架构,旨在解决现代内存墙带来的难题。RPU引入了:(1)容量优化的高带宽内存(HBM-CO),通过牺牲容量以降低能耗与成本;(2)采用带宽优先的功耗与面积分配设计的可扩展小芯片架构;(3)解耦微架构,将内存、计算与通信流水线分离以维持高带宽利用率。仿真结果表明,在ISO-TDP条件下运行Llama3-405B模型时,相较于H100系统,RPU可实现高达45.3倍的延迟降低与18.6倍的吞吐量提升。