FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGA

Shulin Zeng,Jun Liu,Guohao Dai,Xinhao Yang,Tianyu Fu,Hongyi Wang,Wenheng Ma,Hanbo Sun,Shiyao Li,Zixiao Huang,Yadong Dai,Jintao Li,Zehao Wang,Ruoyu Zhang,Kairui Wen,Xuefei Ning,Yu Wang

from arxiv, Accepted to FPGA'24

Transformer-based Large Language Models (LLMs) have made a significant impact on various domains. However, LLMs' efficiency suffers from both heavy computation and memory overheads. Compression techniques like sparsification and quantization are commonly used to mitigate the gap between LLM's computation/memory overheads and hardware capacity. However, existing GPU and transformer-based accelerators cannot efficiently process compressed LLMs, due to the following unresolved challenges: low computational efficiency, underutilized memory bandwidth, and large compilation overheads. This paper proposes FlightLLM, enabling efficient LLMs inference with a complete mapping flow on FPGAs. In FlightLLM, we highlight an innovative solution that the computation and memory overhead of LLMs can be solved by utilizing FPGA-specific resources (e.g., DSP48 and heterogeneous memory hierarchy). We propose a configurable sparse DSP chain to support different sparsity patterns with high computation efficiency. Second, we propose an always-on-chip decode scheme to boost memory bandwidth with mixed-precision support. Finally, to make FlightLLM available for real-world LLMs, we propose a length adaptive compilation method to reduce the compilation overhead. Implemented on the Xilinx Alveo U280 FPGA, FlightLLM achieves 6.0$\times$ higher energy efficiency and 1.8$\times$ better cost efficiency against commercial GPUs (e.g., NVIDIA V100S) on modern LLMs (e.g., LLaMA2-7B) using vLLM and SmoothQuant under the batch size of one. FlightLLM beats NVIDIA A100 GPU with 1.2$\times$ higher throughput using the latest Versal VHK158 FPGA.

翻译：基于Transformer的大语言模型（LLMs）已在各领域产生重大影响。然而，LLMs的效率受限于高计算量与内存开销。稀疏化和量化等压缩技术常被用于弥合LLM计算/内存开销与硬件容量之间的差距。但现有GPU和基于Transformer的加速器因以下未解决挑战而无法高效处理压缩后的LLMs：计算效率低下、内存带宽利用率不足以及编译开销过大。本文提出FlightLLM，通过FPGA上的完整映射流程实现高效LLM推理。在FlightLLM中，我们提出了一项创新性解决方案：利用FPGA特有资源（如DSP48和异构内存层次结构）解决LLM的计算与内存开销问题。首先，我们提出一种可配置的稀疏DSP链，以高计算效率支持不同稀疏模式。其次，提出一种支持混合精度的常驻芯片解码方案，以提升内存带宽利用率。最后，为使FlightLLM适用于实际LLMs，提出一种长度自适应编译方法降低编译开销。在Xilinx Alveo U280 FPGA上实现的FlightLLM，在单批次推理条件下，针对现代LLMs（如LLaMA2-7B）使用vLLM和SmoothQuant时，相比商用GPU（如NVIDIA V100S）能效提升6.0倍，成本效率提升1.8倍。采用最新Versal VHK158 FPGA的FlightLLM以1.2倍吞吐量超越NVIDIA A100 GPU。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日