FlashDecoding++: Faster Large Language Model Inference on GPUs

As the Large Language Model (LLM) becomes increasingly important in various domains. However, the following challenges still remain unsolved in accelerating LLM inference: (1) Synchronized partial softmax update. The softmax operation requires a synchronized update operation among each partial softmax result, leading to ~20% overheads for the attention computation in LLMs. (2) Under-utilized computation of flat GEMM. The shape of matrices performing GEMM in LLM inference is flat, leading to under-utilized computation and >50% performance loss after padding zeros in previous designs. (3) Performance loss due to static dataflow. Kernel performance in LLM depends on varied input data features, hardware configurations, etc. A single and static dataflow may lead to a 50.25% performance loss for GEMMs of different shapes in LLM inference. We present FlashDecoding++, a fast LLM inference engine supporting mainstream LLMs and hardware back-ends. To tackle the above challenges, FlashDecoding++ creatively proposes: (1) Asynchronized softmax with unified max value. FlashDecoding++ introduces a unified max value technique for different partial softmax computations to avoid synchronization. (2) Flat GEMM optimization with double buffering. FlashDecoding++ points out that flat GEMMs with different shapes face varied bottlenecks. Then, techniques like double buffering are introduced. (3) Heuristic dataflow with hardware resource adaptation. FlashDecoding++ heuristically optimizes dataflow using different hardware resource considering input dynamics. Due to the versatility of optimizations in FlashDecoding++, FlashDecoding++ can achieve up to 4.86x and 2.18x speedup on both NVIDIA and AMD GPUs compared to Hugging Face implementations. FlashDecoding++ also achieves an average speedup of 1.37x compared to state-of-the-art LLM inference engines on mainstream LLMs.

翻译：随着大型语言模型（LLM）在各领域日益重要，但加速LLM推理仍面临以下未解难题：（1）同步部分Softmax更新。Softmax操作需要对每个部分Softmax结果进行同步更新操作，导致LLM注意力计算中约20%的额外开销。（2）扁平GEMM计算利用率不足。LLM推理中执行GEMM的矩阵形状扁平，导致计算利用率低下，且在先前设计中补零后性能损失超过50%。（3）静态数据流导致的性能损失。LLM中的内核性能取决于输入数据特征、硬件配置等多种因素。单一静态数据流可能导致LLM推理中不同形状GEMM出现50.25%的性能损失。我们提出FlashDecoding++，一个支持主流LLM和硬件后端的快速推理引擎。针对上述挑战，FlashDecoding++创新性地提出：（1）基于统一最大值的异步Softmax。FlashDecoding++引入统一最大值技术，用于不同部分Softmax计算以避免同步。（2）基于双缓冲的扁平GEMM优化。FlashDecoding++指出不同形状的扁平GEMM面临不同瓶颈，进而引入双缓冲等技术。（3）适配硬件资源的启发式数据流。FlashDecoding++考虑输入动态性，利用不同硬件资源启发式优化数据流。得益于FlashDecoding++优化的通用性，相较于Hugging Face实现，在NVIDIA和AMD GPU上可实现最高4.86倍和2.18倍加速。与主流LLM上最先进的推理引擎相比，FlashDecoding++平均实现1.37倍加速。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日