Large Language Model (LLM) inference is hard. The autoregressive Decode phase of the underlying Transformer model makes LLM inference fundamentally different from training. Exacerbated by recent AI trends, the primary challenges are memory and interconnect rather than compute. To address these challenges, we highlight four architecture research opportunities: High Bandwidth Flash for 10X memory capacity with HBM-like bandwidth; Processing-Near-Memory and 3D memory-logic stacking for high memory bandwidth; and low-latency interconnect to speedup communication. While our focus is datacenter AI, we also review their applicability for mobile devices.
翻译:大语言模型(LLM)推理是一项艰巨的任务。底层Transformer模型的自回归解码阶段使得LLM推理在根本上不同于训练过程。受近期人工智能发展趋势的影响,主要挑战已从计算能力转向内存与互连带宽。为应对这些挑战,我们重点提出四项架构研究机遇:具备类似HBM带宽且容量提升10倍的高带宽闪存;通过近内存处理与三维内存-逻辑堆叠实现高内存带宽;以及低延迟互连以加速通信。尽管我们的研究聚焦于数据中心人工智能,同时也探讨了这些技术在移动设备上的适用性。