Deep supervision, which involves extra supervisions to the intermediate features of a neural network, was widely used in image classification in the early deep learning era since it significantly reduces the training difficulty and eases the optimization like avoiding gradient vanish over the vanilla training. Nevertheless, with the emergence of normalization techniques and residual connection, deep supervision in image classification was gradually phased out. In this paper, we revisit deep supervision for masked image modeling (MIM) that pre-trains a Vision Transformer (ViT) via a mask-and-predict scheme. Experimentally, we find that deep supervision drives the shallower layers to learn more meaningful representations, accelerates model convergence, and expands attention diversities. Our approach, called DeepMIM, significantly boosts the representation capability of each layer. In addition, DeepMIM is compatible with many MIM models across a range of reconstruction targets. For instance, using ViT-B, DeepMIM on MAE achieves 84.2 top-1 accuracy on ImageNet, outperforming MAE by +0.6. By combining DeepMIM with a stronger tokenizer CLIP, our model achieves state-of-the-art performance on various downstream tasks, including image classification (85.6 top-1 accuracy on ImageNet-1K, outperforming MAE-CLIP by +0.8), object detection (52.8 APbox on COCO) and semantic segmentation (53.1 mIoU on ADE20K). Code and models are available at https://github.com/OliverRensu/DeepMIM.
翻译:深度监督(Deep Supervision)通过在神经网络中间层施加额外监督信号,在早期深度学习时代的图像分类任务中得到了广泛应用,因其能有效降低训练难度并缓解优化问题,例如避免标准训练中的梯度消失。然而,随着归一化技术和残差连接的出现,图像分类中的深度监督逐渐被淘汰。本文重新探讨了掩码图像建模(MIM)中的深度监督方法,该方法通过掩码预测机制预训练视觉Transformer(ViT)。实验发现,深度监督能促使浅层网络学习更具意义的表示,加速模型收敛,并扩大注意力多样性。我们提出的方法称为DeepMIM,能够显著提升每一层的表示能力。此外,DeepMIM与多种面向不同重建目标域的MIM模型兼容。例如,在ViT-B架构上,基于MAE的DeepMIM在ImageNet上达到84.2%的top-1准确率,较原始MAE提升0.6%。通过将DeepMIM与更强的分词器CLIP结合,我们的模型在多种下游任务中达到最优性能,包括图像分类(ImageNet-1K上top-1准确率85.6%,较MAE-CLIP提升0.8%)、目标检测(COCO上APbox为52.8)和语义分割(ADE20K上mIoU为53.1)。代码与模型已开源至https://github.com/OliverRensu/DeepMIM。