Computer vision researchers are embracing two promising paradigms: Vision Transformers (ViTs) and Multi-task Learning (MTL), which both show great performance but are computation-intensive, given the quadratic complexity of self-attention in ViT and the need to activate an entire large MTL model for one task. M$^3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE), where only a small portion of subnetworks ("experts") are sparsely and dynamically activated based on the current task. M$^3$ViT achieves better accuracy and over 80% computation reduction but leaves challenges for efficient deployment on FPGA. Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations, including (1) a novel reordering mechanism for self-attention, which requires only constant bandwidth regardless of the target parallelism; (2) a fast single-pass softmax approximation; (3) an accurate and low-cost GELU approximation; (4) a unified and flexible computing unit that is shared by almost all computational layers to maximally reduce resource usage; and (5) uniquely for M$^3$ViT, a novel patch reordering method to eliminate memory access overhead. Edge-MoE achieves 2.24x and 4.90x better energy efficiency comparing with GPU and CPU, respectively. A real-time video demonstration is available online, along with our open-source code written using High-Level Synthesis.
翻译:计算机视觉研究者正在拥抱两种有前景的范式:视觉Transformer(ViT)和多任务学习(MTL),两者均展现出卓越性能,但由于ViT自注意力机制的二次复杂度以及单任务需激活整个大型MTL模型,导致计算量极为密集。M³ViT是最新引入混合专家(MoE)的多任务ViT模型,其中仅有一小部分子网络("专家")根据当前任务被稀疏且动态地激活。M³ViT在提升精度的同时实现了超过80%的计算量削减,但给FPGA上的高效部署带来了挑战。我们提出的Edge-MoE通过一系列架构创新解决了这些难题,首次实现了面向多任务ViT的端到端FPGA加速器,这些创新包括:(1)一种新颖的自注意力重排序机制,无论目标并行度如何,仅需恒定带宽;(2)一种快速的单次通过Softmax近似方法;(3)一种精确且低成本的GELU近似方法;(4)一个统一且灵活的计算单元,几乎被所有计算层共享以最大化降低资源占用;(5)针对M³ViT特有的新型补丁重排序方法,消除了内存访问开销。与GPU和CPU相比,Edge-MoE的能效分别提升2.24倍和4.90倍。我们已在线上提供实时视频演示以及使用高层次综合编写的开源代码。