Vision Transformers (ViTs) represent a groundbreaking shift in machine learning approaches to computer vision. Unlike traditional approaches, ViTs employ the self-attention mechanism, which has been widely used in natural language processing, to analyze image patches. Despite their advantages in modeling visual tasks, deploying ViTs on hardware platforms, notably Field-Programmable Gate Arrays (FPGAs), introduces considerable challenges. These challenges stem primarily from the non-linear calculations and high computational and memory demands of ViTs. This paper introduces CHOSEN, a software-hardware co-design framework to address these challenges and offer an automated framework for ViT deployment on the FPGAs in order to maximize performance. Our framework is built upon three fundamental contributions: multi-kernel design to maximize the bandwidth, mainly targeting benefits of multi DDR memory banks, approximate non-linear functions that exhibit minimal accuracy degradation, and efficient use of available logic blocks on the FPGA, and efficient compiler to maximize the performance and memory-efficiency of the computing kernels by presenting a novel algorithm for design space exploration to find optimal hardware configuration that achieves optimal throughput and latency. Compared to the state-of-the-art ViT accelerators, CHOSEN achieves a 1.5x and 1.42x improvement in the throughput on the DeiT-S and DeiT-B models.
翻译:视觉Transformer(ViTs)代表了机器学习在计算机视觉领域方法上的突破性转变。与传统方法不同,ViTs采用在自然语言处理中广泛使用的自注意力机制来分析图像块。尽管其在视觉任务建模方面具有优势,但将ViTs部署在硬件平台(尤其是现场可编程门阵列(FPGAs))上面临着巨大挑战。这些挑战主要源于ViTs的非线性计算以及其高计算和内存需求。本文介绍了CHOSEN,一个软硬件协同设计框架,旨在应对这些挑战,并为在FPGAs上部署ViT提供一个自动化框架,以实现性能最大化。我们的框架建立在三个核心贡献之上:最大化带宽的多内核设计(主要针对多DDR内存库的优势)、精度损失极小的近似非线性函数、对FPGA上可用逻辑块的高效利用,以及一个高效的编译器。该编译器通过提出一种新颖的设计空间探索算法来寻找实现最优吞吐量和延迟的硬件配置,从而最大化计算内核的性能和内存效率。与最先进的ViT加速器相比,CHOSEN在DeiT-S和DeiT-B模型上分别实现了1.5倍和1.42倍的吞吐量提升。