Heuristic Vision Pre-Training with Self-Supervised and Supervised Multi-Task Learning

To mimic human vision with the way of recognizing the diverse and open world, foundation vision models are much critical. While recent techniques of self-supervised learning show the promising potentiality of this mission, we argue that signals from labelled data are also important for common-sense recognition, and properly chosen pre-text tasks can facilitate the efficiency of vision representation learning. To this end, we propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner. Specifically, given an image, we take a heuristic way by considering its intrinsic style properties, inside objects with their locations and correlations, and how it looks like in 3D space for basic visual understanding. However, large-scale object bounding boxes and correlations are usually hard to achieve. Alternatively, we develop a hybrid method by leveraging both multi-label classification and self-supervised learning. On the one hand, under the multi-label supervision, the pre-trained model can explore the detailed information of an image, e.g., image types, objects, and part of semantic relations. On the other hand, self-supervised learning tasks, with respect to Masked Image Modeling (MIM) and contrastive learning, can help the model learn pixel details and patch correlations. Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks. For example, with a vanilla Swin-B backbone, we achieve 85.3\% top-1 accuracy on ImageNet-1K classification, 47.9 box AP on COCO object detection for Mask R-CNN, and 50.6 mIoU on ADE-20K semantic segmentation when using Upernet. The performance shows the ability of our vision foundation model to serve general purpose vision tasks.

翻译：为模仿人类识别多样化开放世界的视觉方式，基础视觉模型至关重要。尽管近期自监督学习技术在该任务中展现出潜力，但我们认为标注数据中的信号对常识识别同样重要，且恰当选择的预训练任务能提升视觉表征学习效率。为此，我们提出一种新型预训练框架，通过多任务方式同时采用自监督与监督式视觉预训练任务。具体而言，给定图像时，我们采用启发式方法，综合考虑其内在风格属性、目标及其位置与关联性，以及三维空间中的形态表征，以实现基础视觉理解。然而，大规模目标边界框及其关联关系通常难以获取。我们另辟蹊径，通过结合多标签分类与自监督学习开发了一种混合方法：一方面，在多标签监督下，预训练模型可挖掘图像的细节信息（如图像类型、目标及部分语义关联）；另一方面，基于掩码图像建模（MIM）与对比学习的自监督任务，可辅助模型学习像素级细节与图块相关性。实验结果表明，我们的预训练模型在多个视觉任务上取得了与当前最优方法（SOTA）相当或更优的结果。例如，基于标准Swin-B骨干网络，我们在ImageNet-1K分类任务中达到85.3%的top-1准确率，在COCO目标检测任务中基于Mask R-CNN实现47.9的box AP，在ADE-20K语义分割任务中基于Upernet取得50.6的mIoU。该性能表明我们的视觉基础模型具备服务通用视觉任务的能力。