Many state-of-the-art deep learning models for computer vision tasks are based on the transformer architecture. Such models can be computationally expensive and are typically statically set to meet the deployment scenario. However, in real-time applications, the resources available for every inference can vary considerably and be smaller than what state-of-the-art models require. We can use dynamic models to adapt the model execution to meet real-time application resource constraints. While prior dynamic work primarily minimized resource utilization for less complex input images, we adapt vision transformers to meet system dynamic resource constraints, independent of the input image. We find that unlike early transformer models, recent state-of-the-art vision transformers heavily rely on convolution layers. We show that pretrained models are fairly resilient to skipping computation in the convolution and self-attention layers, enabling us to create a low-overhead system for dynamic real-time inference without extra training. Finally, we explore compute organization and memory sizes to find settings to efficiency execute dynamic vision transformers. We find that wider vector sizes produce a better energy-accuracy tradeoff across dynamic configurations despite limiting the granularity of dynamic execution, but scaling accelerator resources for larger models does not significantly improve the latency-area-energy-tradeoffs. Our accelerator saves 20% of execution time and 30% of energy with a 4% drop in accuracy with pretrained SegFormer B2 model in our dynamic inference approach and 57% of execution time for the ResNet-50 backbone with a 4.5% drop in accuracy with the Once-For-All approach.
翻译:许多用于计算机视觉任务的最新深度学习模型基于Transformer架构。此类模型可能计算开销较大,且通常静态设置以满足部署场景。然而,在实时应用中,每次推理可用的资源可能变化显著,且可能小于最新模型所需资源。我们可以利用动态模型调整模型执行方式,以满足实时应用的资源约束。先前动态工作主要针对复杂度较低的输入图像以最小化资源利用率,而本研究则使视觉Transformer适应系统动态资源约束,且不依赖于输入图像。我们发现,与早期Transformer模型不同,近期最先进的视觉Transformer严重依赖卷积层。我们的研究表明,预训练模型在跳过卷积层和自注意力层的计算时具有相当的鲁棒性,从而无需额外训练即可构建低开销的动态实时推理系统。最后,我们探索了计算组织方式与存储器规模,以寻找高效执行动态视觉Transformer的设置。我们发现,尽管较宽的向量尺寸限制了动态执行的粒度,但在不同动态配置下能实现更好的能量-精度权衡;而扩展加速器资源以适配更大模型并未显著改善延迟-面积-能量权衡。在我们的动态推理方法中,加速器在预训练SegFormer B2模型上节省了20%的执行时间与30%的能量消耗,同时精度下降4%;在Once-For-All方法中,采用ResNet-50骨干网络时节省了57%的执行时间,同时精度下降4.5%。