Transformer-based Large Language Models (LLMs) have made a significant impact on various domains. However, LLMs' efficiency suffers from both heavy computation and memory overheads. Compression techniques like sparsification and quantization are commonly used to mitigate the gap between LLM's computation/memory overheads and hardware capacity. However, existing GPU and transformer-based accelerators cannot efficiently process compressed LLMs, due to the following unresolved challenges: low computational efficiency, underutilized memory bandwidth, and large compilation overheads. This paper proposes FlightLLM, enabling efficient LLMs inference with a complete mapping flow on FPGAs. In FlightLLM, we highlight an innovative solution that the computation and memory overhead of LLMs can be solved by utilizing FPGA-specific resources (e.g., DSP48 and heterogeneous memory hierarchy). We propose a configurable sparse DSP chain to support different sparsity patterns with high computation efficiency. Second, we propose an always-on-chip decode scheme to boost memory bandwidth with mixed-precision support. Finally, to make FlightLLM available for real-world LLMs, we propose a length adaptive compilation method to reduce the compilation overhead. Implemented on the Xilinx Alveo U280 FPGA, FlightLLM achieves 6.0$\times$ higher energy efficiency and 1.8$\times$ better cost efficiency against commercial GPUs (e.g., NVIDIA V100S) on modern LLMs (e.g., LLaMA2-7B) using vLLM and SmoothQuant under the batch size of one. FlightLLM beats NVIDIA A100 GPU with 1.2$\times$ higher throughput using the latest Versal VHK158 FPGA.
翻译:基于Transformer的大语言模型(LLMs)已在各领域产生重大影响。然而,LLMs的效率受限于高计算量与内存开销。稀疏化和量化等压缩技术常被用于弥合LLM计算/内存开销与硬件容量之间的差距。但现有GPU和基于Transformer的加速器因以下未解决挑战而无法高效处理压缩后的LLMs:计算效率低下、内存带宽利用率不足以及编译开销过大。本文提出FlightLLM,通过FPGA上的完整映射流程实现高效LLM推理。在FlightLLM中,我们提出了一项创新性解决方案:利用FPGA特有资源(如DSP48和异构内存层次结构)解决LLM的计算与内存开销问题。首先,我们提出一种可配置的稀疏DSP链,以高计算效率支持不同稀疏模式。其次,提出一种支持混合精度的常驻芯片解码方案,以提升内存带宽利用率。最后,为使FlightLLM适用于实际LLMs,提出一种长度自适应编译方法降低编译开销。在Xilinx Alveo U280 FPGA上实现的FlightLLM,在单批次推理条件下,针对现代LLMs(如LLaMA2-7B)使用vLLM和SmoothQuant时,相比商用GPU(如NVIDIA V100S)能效提升6.0倍,成本效率提升1.8倍。采用最新Versal VHK158 FPGA的FlightLLM以1.2倍吞吐量超越NVIDIA A100 GPU。