The rapid advancements in artificial intelligence (AI), particularly the Large Language Models (LLMs), have profoundly affected our daily work and communication forms. However, the colossal scale of LLM presents significant operational challenges, particularly when attempting to deploy them on resource-constrained edge devices such as smartphones, robots, and embedded systems. In this work, we proposed EdgeLLM, an efficient CPU-FPGA heterogeneous acceleration framework, to markedly enhance the computational efficiency of LLMs on edge. We first analyzed the whole operators within AI models and developed a universal data parallelism scheme, which is generic and can be adapted to any type of AI algorithm. Then, we developed fully-customized hardware operators according to the designated data formats. A multitude of optimization techniques have been integrated in the design, such as approximate FP16*INT4 and FP16*FP16 computation engines, group vector systolic arrays, log-scale structured sparsity, asynchronous between data transfer and processing. Finally, we proposed an end-to-end compilation scheme that can dynamically compile all of the operators and map the whole model on CPU-FPGA heterogeneous system. The design has been deployed on AMD Xilinx VCU128 FPGA, our accelerator achieves 1.67x higher throughput and 7.4x higher energy efficiency than the commercial GPU (NVIDIA A100-SXM4-80G) on ChatGLM2-6B, and shows 10%~20% better performance than state-of-the-art FPGA accelerator of FlightLLM in terms of HBM bandwidth utilization and LLM throughput.
翻译:人工智能(AI)尤其是大语言模型(LLMs)的快速发展,已深刻影响我们的日常工作与交流方式。然而,LLM的巨大规模带来了显著的运行挑战,尤其是在尝试将其部署于智能手机、机器人、嵌入式系统等资源受限的边缘设备时。本文提出EdgeLLM,一种高效的CPU-FPGA异构加速框架,旨在显著提升LLM在边缘设备上的计算效率。我们首先分析了AI模型中的所有算子,并开发了一种通用的数据并行方案,该方案具有普适性,可适配任何类型的AI算法。随后,我们根据指定的数据格式开发了完全定制化的硬件算子。设计中集成了多种优化技术,例如近似FP16*INT4与FP16*FP16计算引擎、分组向量脉动阵列、对数尺度结构化稀疏性、数据传输与处理异步化等。最后,我们提出一种端到端编译方案,能够动态编译所有算子并将完整模型映射到CPU-FPGA异构系统上。该设计已部署于AMD Xilinx VCU128 FPGA平台。在ChatGLM2-6B模型上,我们的加速器相比商用GPU(NVIDIA A100-SXM4-80G)实现了1.67倍的吞吐量提升和7.4倍的能效提升,并且在HBM带宽利用率和LLM吞吐量方面,相比当前最先进的FPGA加速器FlightLLM表现出10%~20%的性能优势。