FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

Shulin Zeng,Jun Liu,Guohao Dai,Xinhao Yang,Tianyu Fu,Hongyi Wang,Wenheng Ma,Hanbo Sun,Shiyao Li,Zixiao Huang,Yadong Dai,Jintao Li,Zehao Wang,Ruoyu Zhang,Kairui Wen,Xuefei Ning,Yu Wang

from arxiv, Accepted to FPGA'24

Transformer-based Large Language Models (LLMs) have made a significant impact on various domains. However, LLMs' efficiency suffers from both heavy computation and memory overheads. Compression techniques like sparsification and quantization are commonly used to mitigate the gap between LLM's computation/memory overheads and hardware capacity. However, existing GPU and transformer-based accelerators cannot efficiently process compressed LLMs, due to the following unresolved challenges: low computational efficiency, underutilized memory bandwidth, and large compilation overheads. This paper proposes FlightLLM, enabling efficient LLMs inference with a complete mapping flow on FPGAs. In FlightLLM, we highlight an innovative solution that the computation and memory overhead of LLMs can be solved by utilizing FPGA-specific resources (e.g., DSP48 and heterogeneous memory hierarchy). We propose a configurable sparse DSP chain to support different sparsity patterns with high computation efficiency. Second, we propose an always-on-chip decode scheme to boost memory bandwidth with mixed-precision support. Finally, to make FlightLLM available for real-world LLMs, we propose a length adaptive compilation method to reduce the compilation overhead. Implemented on the Xilinx Alveo U280 FPGA, FlightLLM achieves 6.0$\times$ higher energy efficiency and 1.8$\times$ better cost efficiency against commercial GPUs (e.g., NVIDIA V100S) on modern LLMs (e.g., LLaMA2-7B) using vLLM and SmoothQuant under the batch size of one. FlightLLM beats NVIDIA A100 GPU with 1.2$\times$ higher throughput using the latest Versal VHK158 FPGA.

翻译：基于Transformer的大语言模型（LLMs）已在多个领域产生重大影响。然而，LLMs的效率受限于繁重的计算与内存开销。稀疏化、量化等压缩技术常被用于弥合LLM计算/内存开销与硬件容量之间的差距。但现有GPU及基于Transformer的加速器因以下未解决的挑战而无法高效处理压缩后的LLM：计算效率低、内存带宽利用不足、编译开销大。本文提出FlightLLM，通过在FPGA上实现完整映射流程，实现高效的LLM推理。FlightLLM的核心创新在于，利用FPGA特有资源（如DSP48和异构存储层次）解决LLM的计算与内存开销问题。我们提出可配置的稀疏DSP链以支持不同稀疏模式并保持高计算效率；其次，提出全时间片解码方案以提升内存带宽并支持混合精度；最后，为使FlightLLM适用于真实场景的LLM，提出长度自适应编译方法以降低编译开销。在Xilinx Alveo U280 FPGA上实现后，FlightLLM在vLLM与SmoothQuant框架下处理现代LLM（如LLaMA2-7B）且批大小为1时，相比商用GPU（如NVIDIA V100S）能效提升6.0倍，成本效率提升1.8倍。采用最新Versal VHK158 FPGA的FlightLLM更以1.2倍吞吐量超越NVIDIA A100 GPU。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日