As the Large Language Model (LLM) becomes increasingly important in various domains. However, the following challenges still remain unsolved in accelerating LLM inference: (1) Synchronized partial softmax update. The softmax operation requires a synchronized update operation among each partial softmax result, leading to ~20% overheads for the attention computation in LLMs. (2) Under-utilized computation of flat GEMM. The shape of matrices performing GEMM in LLM inference is flat, leading to under-utilized computation and >50% performance loss after padding zeros in previous designs. (3) Performance loss due to static dataflow. Kernel performance in LLM depends on varied input data features, hardware configurations, etc. A single and static dataflow may lead to a 50.25% performance loss for GEMMs of different shapes in LLM inference. We present FlashDecoding++, a fast LLM inference engine supporting mainstream LLMs and hardware back-ends. To tackle the above challenges, FlashDecoding++ creatively proposes: (1) Asynchronized softmax with unified max value. FlashDecoding++ introduces a unified max value technique for different partial softmax computations to avoid synchronization. (2) Flat GEMM optimization with double buffering. FlashDecoding++ points out that flat GEMMs with different shapes face varied bottlenecks. Then, techniques like double buffering are introduced. (3) Heuristic dataflow with hardware resource adaptation. FlashDecoding++ heuristically optimizes dataflow using different hardware resource considering input dynamics. Due to the versatility of optimizations in FlashDecoding++, FlashDecoding++ can achieve up to 4.86x and 2.18x speedup on both NVIDIA and AMD GPUs compared to Hugging Face implementations. FlashDecoding++ also achieves an average speedup of 1.37x compared to state-of-the-art LLM inference engines on mainstream LLMs.
翻译:随着大型语言模型(LLM)在各领域日益重要,但加速LLM推理仍面临以下未解难题:(1)同步部分Softmax更新。Softmax操作需要对每个部分Softmax结果进行同步更新操作,导致LLM注意力计算中约20%的额外开销。(2)扁平GEMM计算利用率不足。LLM推理中执行GEMM的矩阵形状扁平,导致计算利用率低下,且在先前设计中补零后性能损失超过50%。(3)静态数据流导致的性能损失。LLM中的内核性能取决于输入数据特征、硬件配置等多种因素。单一静态数据流可能导致LLM推理中不同形状GEMM出现50.25%的性能损失。我们提出FlashDecoding++,一个支持主流LLM和硬件后端的快速推理引擎。针对上述挑战,FlashDecoding++创新性地提出:(1)基于统一最大值的异步Softmax。FlashDecoding++引入统一最大值技术,用于不同部分Softmax计算以避免同步。(2)基于双缓冲的扁平GEMM优化。FlashDecoding++指出不同形状的扁平GEMM面临不同瓶颈,进而引入双缓冲等技术。(3)适配硬件资源的启发式数据流。FlashDecoding++考虑输入动态性,利用不同硬件资源启发式优化数据流。得益于FlashDecoding++优化的通用性,相较于Hugging Face实现,在NVIDIA和AMD GPU上可实现最高4.86倍和2.18倍加速。与主流LLM上最先进的推理引擎相比,FlashDecoding++平均实现1.37倍加速。