Vision Transformer models, such as ViT, Swin Transformer, and Transformer-in-Transformer, have recently gained significant traction in computer vision tasks due to their ability to capture the global relation between features which leads to superior performance. However, they are compute-heavy and difficult to deploy in resource-constrained edge devices. Existing hardware accelerators, including those for the closely-related BERT transformer models, do not target highly resource-constrained environments. In this paper, we address this gap and propose ViTA - a configurable hardware accelerator for inference of vision transformer models, targeting resource-constrained edge computing devices and avoiding repeated off-chip memory accesses. We employ a head-level pipeline and inter-layer MLP optimizations, and can support several commonly used vision transformer models with changes solely in our control logic. We achieve nearly 90% hardware utilization efficiency on most vision transformer models, report a power of 0.88W when synthesised with a clock of 150 MHz, and get reasonable frame rates - all of which makes ViTA suitable for edge applications.
翻译:视觉Transformer模型(如ViT、Swin Transformer和Transformer-in-Transformer)因能捕获特征间的全局关联而表现出卓越性能,近期在计算机视觉任务中取得显著进展。然而,此类模型计算密集,难以部署在资源受限的边缘设备上。现有硬件加速器(包括针对高度相关的BERT Transformer模型的加速器)并未面向高度资源受限的环境。本文针对这一空白,提出ViTA——一种针对资源受限边缘计算设备、可配置的视觉Transformer模型推理硬件加速器,其设计避免了重复的片外存储器访问。我们采用了头级流水线和层间MLP优化技术,仅需更改控制逻辑即可支持多种常用视觉Transformer模型。实验表明,该加速器在大多数视觉Transformer模型上实现了近90%的硬件利用率,综合时钟频率为150 MHz时功耗仅为0.88W,且能达到合理的帧率——这些特性使ViTA适用于边缘应用场景。