A Runtime-Adaptive Transformer Neural Network Accelerator on FPGAs

Transformer neural networks (TNN) excel in natural language processing (NLP), machine translation, and computer vision (CV) without relying on recurrent or convolutional layers. However, they have high computational and memory demands, particularly on resource-constrained devices like FPGAs. Moreover, transformer models vary in processing time across applications, requiring custom models with specific parameters. Designing custom accelerators for each model is complex and time-intensive. Some custom accelerators exist with no runtime adaptability, and they often rely on sparse matrices to reduce latency. However, hardware designs become more challenging due to the need for application-specific sparsity patterns. This paper introduces ADAPTOR, a runtime-adaptive accelerator for dense matrix computations in transformer encoders and decoders on FPGAs. ADAPTOR enhances the utilization of processing elements and on-chip memory, enhancing parallelism and reducing latency. It incorporates efficient matrix tiling to distribute resources across FPGA platforms and is fully quantized for computational efficiency and portability. Evaluations on Xilinx Alveo U55C data center cards and embedded platforms like VC707 and ZCU102 show that our design is 1.2$\times$ and 2.87$\times$ more power efficient than the NVIDIA K80 GPU and the i7-8700K CPU respectively. Additionally, it achieves a speedup of 1.7 to 2.25$\times$ compared to some state-of-the-art FPGA-based accelerators.

翻译：Transformer神经网络（TNN）在不依赖循环层或卷积层的情况下，在自然语言处理（NLP）、机器翻译和计算机视觉（CV）领域表现出色。然而，它们具有较高的计算和内存需求，尤其是在FPGA等资源受限的设备上。此外，Transformer模型在不同应用中的处理时间各不相同，需要具有特定参数的自定义模型。为每个模型设计定制加速器既复杂又耗时。现有的一些定制加速器缺乏运行时自适应性，并且通常依赖稀疏矩阵来降低延迟。然而，由于需要特定于应用的稀疏模式，硬件设计变得更加具有挑战性。本文介绍了ADAPTOR，一种用于FPGA上Transformer编码器和解码器中稠密矩阵计算的运行时自适应加速器。ADAPTOR提高了处理单元和片上存储器的利用率，从而增强了并行性并降低了延迟。它采用了高效的矩阵分块技术，以在FPGA平台间分配资源，并进行了完全量化以提高计算效率和可移植性。在Xilinx Alveo U55C数据中心卡以及VC707和ZCU102等嵌入式平台上的评估表明，我们的设计分别比NVIDIA K80 GPU和i7-8700K CPU能效高1.2倍和2.87倍。此外，与一些基于FPGA的最先进加速器相比，其加速比达到了1.7至2.25倍。

相关内容

FPGA

关注 18

FPGA：ACM/SIGDA International Symposium on Field-Programmable Gate Arrays。 Explanation：ACM/SIGDA现场可编程门阵列国际研讨会。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/fpga/

[ICCV2025]EAMamba：面向图像恢复的高效全能视觉状态空间模型

专知会员服务

5+阅读 · 2025年7月1日

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

专知会员服务

17+阅读 · 2022年5月10日

【深度神经网络加速器的硬件近似技术综述】Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey

专知会员服务

16+阅读 · 2022年3月17日

AAAI2021 | 图神经网络的异质图结构学习，Heterogeneous Graph Structure Learning for Graph Neural Networks

专知会员服务

92+阅读 · 2021年1月20日