基于开源矩阵指令集扩展（向量点积）的高性能RISC-V处理器“香山”（南湖版）的LLM加速研究 (Research on LLM Acceleration Using the High-Performance RISC-V Processor "Xiangshan" (Nanhu Version) Based on the Open-Source Matrix Instruction Set Extension (Vector Dot Product))

2024 年 9 月 1 日

Research on LLM Acceleration Using the High-Performance RISC-V Processor "Xiangshan" (Nanhu Version) Based on the Open-Source Matrix Instruction Set Extension (Vector Dot Product)

翻译：基于开源矩阵指令集扩展（向量点积）的高性能RISC-V处理器“香山”（南湖版）的LLM加速研究

Xu-Hao Chen,Si-Peng Hu,Hong-Chao Liu,Bo-Ran Liu,Dan Tang,Di Zhao

from arxiv, 10 pages, in Chinese language, 6 figures

Considering the high-performance and low-power requirements of edge AI, this study designs a specialized instruction set processor for edge AI based on the RISC-V instruction set architecture, addressing practical issues in digital signal processing for edge devices. This design enhances the execution efficiency of edge AI and reduces its energy consumption with limited hardware overhead, meeting the demands for efficient large language model (LLM) inference computation in edge AI applications. The main contributions of this paper are as follows: For the characteristics of large language models, custom instructions were extended based on the RISC-V instruction set to perform vector dot product calculations, accelerating the computation of large language models on dedicated vector dot product acceleration hardware. Based on the open-source high-performance RISC-V processor core XiangShan Nanhu architecture, the vector dot product specialized instruction set processor Nanhu-vdot was implemented, which adds vector dot product calculation units and pipeline processing logic on top of the XiangShan Nanhu.The Nanhu-vdot underwent FPGA hardware testing, achieving over four times the speed of scalar methods in vector dot product computation. Using a hardware-software co-design approach for second-generation Generative Pre-Trained Transformer (GPT-2) model inference, the speed improved by approximately 30% compared to pure software implementation with almost no additional consumption of hardware resources and power consumption.

翻译：针对边缘AI对高性能和低功耗的需求，本研究基于RISC-V指令集架构设计了面向边缘AI的专用指令集处理器，以解决边缘设备数字信号处理中的实际问题。该设计在有限硬件开销下提升了边缘AI的执行效率并降低了能耗，满足了边缘AI应用中高效大语言模型（LLM）推理计算的需求。本文的主要贡献如下：针对大语言模型的特点，基于RISC-V指令集扩展了自定义指令以执行向量点积计算，在专用向量点积加速硬件上加快了大语言模型的计算。基于开源高性能RISC-V处理器核香山南湖架构，实现了向量点积专用指令集处理器Nanhu-vdot，其在香山南湖基础上增加了向量点积计算单元和流水线处理逻辑。Nanhu-vdot通过了FPGA硬件测试，在向量点积计算中达到了标量方法四倍以上的速度。采用软硬件协同设计方法进行第二代生成式预训练Transformer（GPT-2）模型推理，与纯软件实现相比速度提升约30%，且几乎未增加硬件资源和功耗的额外消耗。