Transformer-based Large Language Models (LLMs) have made a significant impact on various domains. However, LLMs' efficiency suffers from both heavy computation and memory overheads. Compression techniques like sparsification and quantization are commonly used to mitigate the gap between LLM's computation/memory overheads and hardware capacity. However, existing GPU and transformer-based accelerators cannot efficiently process compressed LLMs, due to the following unresolved challenges: low computational efficiency, underutilized memory bandwidth, and large compilation overheads. This paper proposes FlightLLM, enabling efficient LLMs inference with a complete mapping flow on FPGAs. In FlightLLM, we highlight an innovative solution that the computation and memory overhead of LLMs can be solved by utilizing FPGA-specific resources (e.g., DSP48 and heterogeneous memory hierarchy). We propose a configurable sparse DSP chain to support different sparsity patterns with high computation efficiency. Second, we propose an always-on-chip decode scheme to boost memory bandwidth with mixed-precision support. Finally, to make FlightLLM available for real-world LLMs, we propose a length adaptive compilation method to reduce the compilation overhead. Implemented on the Xilinx Alveo U280 FPGA, FlightLLM achieves 6.0$\times$ higher energy efficiency and 1.8$\times$ better cost efficiency against commercial GPUs (e.g., NVIDIA V100S) on modern LLMs (e.g., LLaMA2-7B) using vLLM and SmoothQuant under the batch size of one. FlightLLM beats NVIDIA A100 GPU with 1.2$\times$ higher throughput using the latest Versal VHK158 FPGA.
翻译:基于Transformer的大语言模型(LLMs)已在多个领域产生重大影响。然而,LLMs的效率受限于繁重的计算与内存开销。稀疏化、量化等压缩技术常被用于弥合LLM计算/内存开销与硬件容量之间的差距。但现有GPU及基于Transformer的加速器因以下未解决的挑战而无法高效处理压缩后的LLM:计算效率低、内存带宽利用不足、编译开销大。本文提出FlightLLM,通过在FPGA上实现完整映射流程,实现高效的LLM推理。FlightLLM的核心创新在于,利用FPGA特有资源(如DSP48和异构存储层次)解决LLM的计算与内存开销问题。我们提出可配置的稀疏DSP链以支持不同稀疏模式并保持高计算效率;其次,提出全时间片解码方案以提升内存带宽并支持混合精度;最后,为使FlightLLM适用于真实场景的LLM,提出长度自适应编译方法以降低编译开销。在Xilinx Alveo U280 FPGA上实现后,FlightLLM在vLLM与SmoothQuant框架下处理现代LLM(如LLaMA2-7B)且批大小为1时,相比商用GPU(如NVIDIA V100S)能效提升6.0倍,成本效率提升1.8倍。采用最新Versal VHK158 FPGA的FlightLLM更以1.2倍吞吐量超越NVIDIA A100 GPU。