Large Language Model (LLM) inference is hard. The autoregressive Decode phase of the underlying Transformer model makes LLM inference fundamentally different from training. Exacerbated by recent AI trends, the primary challenges are memory and interconnect rather than compute. To address these challenges, we highlight four architecture research opportunities: High Bandwidth Flash for 10X memory capacity with HBM-like bandwidth; Processing-Near-Memory and 3D memory-logic stacking for high memory bandwidth; and low-latency interconnect to speedup communication. While our focus is datacenter AI, we also review their applicability for mobile devices.
翻译:大语言模型(LLM)推理极具挑战性。底层Transformer模型的自回归解码阶段使得LLM推理在根本上不同于训练过程。受近期人工智能发展趋势的影响,主要挑战已从计算能力转向内存与互连带宽。为应对这些挑战,我们重点提出四项架构研究机遇:具备HBM级别带宽的"高带宽闪存"以实现10倍内存容量;采用近内存处理与三维内存-逻辑堆叠技术以提升内存带宽;以及低延迟互连以加速通信。尽管本文聚焦于数据中心人工智能场景,我们也探讨了这些技术在移动设备上的适用性。