TurboViT: Generating Fast Vision Transformers via Generative Architecture Search

Vision transformers have shown unprecedented levels of performance in tackling various visual perception tasks in recent years. However, the architectural and computational complexity of such network architectures have made them challenging to deploy in real-world applications with high-throughput, low-memory requirements. As such, there has been significant research recently on the design of efficient vision transformer architectures. In this study, we explore the generation of fast vision transformer architecture designs via generative architecture search (GAS) to achieve a strong balance between accuracy and architectural and computational efficiency. Through this generative architecture search process, we create TurboViT, a highly efficient hierarchical vision transformer architecture design that is generated around mask unit attention and Q-pooling design patterns. The resulting TurboViT architecture design achieves significantly lower architectural computational complexity (>2.47$\times$ smaller than FasterViT-0 while achieving same accuracy) and computational complexity (>3.4$\times$ fewer FLOPs and 0.9% higher accuracy than MobileViT2-2.0) when compared to 10 other state-of-the-art efficient vision transformer network architecture designs within a similar range of accuracy on the ImageNet-1K dataset. Furthermore, TurboViT demonstrated strong inference latency and throughput in both low-latency and batch processing scenarios (>3.21$\times$ lower latency and >3.18$\times$ higher throughput compared to FasterViT-0 for low-latency scenario). These promising results demonstrate the efficacy of leveraging generative architecture search for generating efficient transformer architecture designs for high-throughput scenarios.

翻译：近年来，视觉Transformer在解决各类视觉感知任务中展现了前所未有的性能水平。然而，此类网络架构的架构与计算复杂度使其难以部署于需要高吞吐量、低内存的实际应用场景。因此，近期涌现了大量针对高效视觉Transformer架构设计的研究。本研究通过生成式架构搜索探索快速视觉Transformer架构的生成，旨在实现准确率与架构及计算效率的强平衡。通过这一生成式架构搜索过程，我们创建了TurboViT——一种围绕掩码单元注意力与Q池化设计模式生成的高效层级视觉Transformer架构。与ImageNet-1K数据集上相同准确率范围内的10种其他先进高效视觉Transformer网络架构相比，生成的TurboViT架构实现了显著更低的架构计算复杂度（比FasterViT-0小2.47倍以上且准确率持平）与计算复杂度（比MobileViT2-2.0减少3.4倍以上FLOPs，准确率提升0.9%）。此外，在低延迟与批处理场景中，TurboViT展现了强大的推理延迟与吞吐量优势（低延迟场景下比FasterViT-0延迟降低3.21倍以上，吞吐量提升3.18倍以上）。这些令人振奋的结果证明了利用生成式架构搜索为高吞吐量场景生成高效Transformer架构设计的有效性。