Large Language Model (LLM) inference is hard. The autoregressive Decode phase of the underlying Transformer model makes LLM inference fundamentally different from training. Exacerbated by recent AI trends, the primary challenges are memory and interconnect rather than compute. To address these challenges, we highlight four architecture research opportunities: High Bandwidth Flash for 10X memory capacity with HBM-like bandwidth; Processing-Near-Memory and 3D memory-logic stacking for high memory bandwidth; and low-latency interconnect to speedup communication. While our focus is datacenter AI, we also review their applicability for mobile devices.
翻译:大型语言模型(LLM)推理是困难的。底层Transformer模型的自回归解码阶段使得LLM推理在根本上不同于训练。在近期人工智能发展趋势的影响下,主要挑战已从计算转向内存与互连。为应对这些挑战,我们重点提出四个架构研究机遇:具备类HBM带宽的10倍内存容量的高带宽闪存;用于高内存带宽的近内存处理与三维内存-逻辑堆叠;以及用于加速通信的低延迟互连技术。尽管我们的研究聚焦于数据中心人工智能,我们也探讨了这些技术在移动设备上的适用性。