State-of-the-art deep learning models for computer vision tasks are based on the transformer architecture and often deployed in real-time applications. In this scenario, the resources available for every inference can vary, so it is useful to be able to dynamically adapt execution to trade accuracy for efficiency. To create dynamic models, we leverage the resilience of vision transformers to pruning and switch between different scaled versions of a model. Surprisingly, we find that most FLOPs are generated by convolutions, not attention. These relative FLOP counts are not a good predictor of GPU performance since GPUs have special optimizations for convolutions. Some models are fairly resilient and their model execution can be adapted without retraining, while all models achieve better accuracy with retraining alternative execution paths. These insights mean that we can leverage CNN accelerators and these alternative execution paths to enable efficient and dynamic vision transformer inference. Our analysis shows that leveraging this type of dynamic execution can lead to saving 28\% of energy with a 1.4\% accuracy drop for SegFormer (63 GFLOPs), with no additional training, and 53\% of energy for ResNet-50 (4 GFLOPs) with a 3.3\% accuracy drop by switching between pretrained Once-For-All models.
翻译:面向计算机视觉任务的最先进深度学习模型基于Transformer架构,常部署于实时应用场景中。在此类场景下,每次推理可用的计算资源存在波动,因此动态调整执行过程以权衡精度与效率具有实际价值。为构建动态模型,我们利用视觉Transformer对剪枝的鲁棒性,在模型的不同缩放版本间进行切换。令人意外的是,我们发现大部分浮点运算量(FLOPs)来自卷积而非注意力机制。由于图形处理器(GPU)对卷积运算具有特殊优化,此类相对FLOPs占比并非GPU性能的良好预测指标。部分模型的鲁棒性较高,其执行过程无需重新训练即可动态调整,而所有模型在重新训练替代执行路径后均能获得更优精度。这些发现意味着,可借助CNN加速器与替代执行路径实现高效动态的视觉Transformer推理。分析表明,采用此类动态执行策略可在无额外训练条件下,令SegFormer(63 GFLOPs)降低28%能耗(精度损失1.4%);通过切换预训练的Once-For-All模型,可使ResNet-50(4 GFLOPs)降低53%能耗(精度损失3.3%)。