While transformer architectures have dominated computer vision in recent years, these models cannot easily be deployed on hardware with limited resources for autonomous driving tasks that require real-time-performance. Their computational complexity and memory requirements limits their use, especially for applications with high-resolution inputs. In our work, we redesign the powerful state-of-the-art Vision Transformer PLG-ViT to a much more compact and efficient architecture that is suitable for such tasks. We identify computationally expensive blocks in the original PLG-ViT architecture and propose several redesigns aimed at reducing the number of parameters and floating-point operations. As a result of our redesign, we are able to reduce PLG-ViT in size by a factor of 5, with a moderate drop in performance. We propose two variants, optimized for the best trade-off between parameter count to runtime as well as parameter count to accuracy. With only 5 million parameters, we achieve 79.5$\%$ top-1 accuracy on the ImageNet-1K classification benchmark. Our networks demonstrate great performance on general vision benchmarks like COCO instance segmentation. In addition, we conduct a series of experiments, demonstrating the potential of our approach in solving various tasks specifically tailored to the challenges of autonomous driving and transportation.
翻译:尽管近年来Transformer架构在计算机视觉领域占据主导地位,但这类模型难以直接部署于资源受限的硬件上以完成需要实时性能的自动驾驶任务。其计算复杂度和内存需求限制了应用,尤其是对于高分辨率输入场景。本研究对当前先进的视觉Transformer PLG-ViT进行重新设计,将其转化为更紧凑高效的架构以适配此类任务。我们识别出原始PLG-ViT中计算密集的模块,并提出多项旨在减少参数量和浮点运算次数的改进方案。通过重新设计,我们成功将PLG-ViT的规模缩减至原模型的五分之一,同时仅带来适度的性能下降。我们提出两种变体,分别优化参数量与运行时间、参数量与准确率之间的平衡。在ImageNet-1K分类基准测试中,仅用500万参数即达到79.5%的Top-1准确率。我们的网络在COCO实例分割等通用视觉基准上展现出优异性能。此外,我们通过系列实验证明了该方法在解决自动驾驶与交通领域特定挑战性任务中的潜力。