Graphics Processing Units (GPUs) have become the leading hardware accelerator for deep learning applications and are used widely in training and inference of transformers; transformers have achieved state-of-the-art performance in many areas of machine learning and are especially used in most modern Large Language Models (LLMs). However, GPUs require large amounts of energy, which poses environmental concerns, demands high operational costs, and causes GPUs to be unsuitable for edge computing. We develop an accelerator for transformers, namely, Llama 2, an open-source state-of-the-art LLM, using high level synthesis (HLS) on Field Programmable Gate Arrays (FPGAs). HLS allows us to rapidly prototype FPGA designs without writing code at the register-transfer level (RTL). We name our method HLSTransform, and the FPGA designs we synthesize with HLS achieve up to a 12.75x reduction and 8.25x reduction in energy used per token on the Xilinx Virtex UltraScale+ VU9P FPGA compared to an Intel Xeon Broadwell E5-2686 v4 CPU and NVIDIA RTX 3090 GPU respectively, while increasing inference speeds by up to 2.46x compared to CPU and maintaining 0.53x the speed of an RTX 3090 GPU despite the GPU's 4 times higher base clock rate. With the lack of existing open-source FPGA accelerators for transformers, we open-source our code and document our steps for synthesis. We hope this work will serve as a step in democratizing the use of FPGAs in transformer inference and inspire research into energy-efficient inference methods as a whole. The code can be found on https://github.com/HLSTransform/submission.
翻译:图形处理器(GPU)已成为深度学习应用的主流硬件加速器,广泛用于Transformer的训练与推理;Transformer在机器学习多个领域实现了最先进的性能,尤其在大多数现代大语言模型(LLM)中得到应用。然而,GPU能耗极高,不仅引发环境担忧、带来高昂运营成本,还使其不适用于边缘计算。我们基于现场可编程门阵列(FPGA)的高层次综合(HLS)技术,为开源最先进大语言模型Llama 2开发了一种Transformer加速器。HLS使我们无需编写寄存器传输级(RTL)代码即可快速原型化FPGA设计。我们将该方法命名为HLSTransform。相较于Intel Xeon Broadwell E5-2686 v4 CPU和NVIDIA RTX 3090 GPU,我们在Xilinx Virtex UltraScale+ VU9P FPGA上综合的FPGA设计每个token的能耗分别降低达12.75倍和8.25倍,同时推理速度相比CPU提升最高2.46倍,且尽管RTX 3090 GPU基准时钟频率高4倍,我们的推理速度仍可达其0.53倍。鉴于当前开源FPGA加速器在Transformer领域存在空白,我们开源了代码并记录了综合步骤。希望此项工作能成为推动FPGA在Transformer推理中普及的一步,并激发针对节能推理方法的整体研究。代码详见https://github.com/HLSTransform/submission。