The hybrid deep models of Vision Transformer (ViT) and Convolution Neural Network (CNN) have emerged as a powerful class of backbones for vision tasks. Scaling up the input resolution of such hybrid backbones naturally strengthes model capacity, but inevitably suffers from heavy computational cost that scales quadratically. Instead, we present a new hybrid backbone with HIgh-Resolution Inputs (namely HIRI-ViT), that upgrades prevalent four-stage ViT to five-stage ViT tailored for high-resolution inputs. HIRI-ViT is built upon the seminal idea of decomposing the typical CNN operations into two parallel CNN branches in a cost-efficient manner. One high-resolution branch directly takes primary high-resolution features as inputs, but uses less convolution operations. The other low-resolution branch first performs down-sampling and then utilizes more convolution operations over such low-resolution features. Experiments on both recognition task (ImageNet-1K dataset) and dense prediction tasks (COCO and ADE20K datasets) demonstrate the superiority of HIRI-ViT. More remarkably, under comparable computational cost ($\sim$5.0 GFLOPs), HIRI-ViT achieves to-date the best published Top-1 accuracy of 84.3% on ImageNet with 448$\times$448 inputs, which absolutely improves 83.4% of iFormer-S by 0.9% with 224$\times$224 inputs.
翻译:视觉Transformer(ViT)与卷积神经网络(CNN)的混合深度模型已成为视觉任务中一类强大的骨干架构。提升此类混合骨干网络的输入分辨率能自然增强模型容量,但不可避免地会面临计算成本二次增长的问题。本文提出了一种新型高分辨率输入混合骨干网络(简称HIRI-ViT),将主流的四阶段ViT升级为针对高分辨率输入优化的五阶段架构。HIRI-ViT基于将典型CNN操作分解为两个并行低成本CNN分支的核心思想:高分辨率分支直接处理主要的高分辨率特征,但使用较少卷积操作;低分辨率分支先进行下采样,再对低分辨率特征施加更多卷积操作。在图像识别任务(ImageNet-1K数据集)和密集预测任务(COCO与ADE20K数据集)上的实验均验证了HIRI-ViT的优越性。更值得注意的是,在相近计算成本($\sim$5.0 GFLOPs)下,HIRI-ViT在448$\times$448输入条件下于ImageNet数据集上取得了迄今最佳84.3%的Top-1准确率,相比iFormer-S在224$\times$224输入下的83.4%准确率提升了0.9%。