To mimic human vision with the way of recognizing the diverse and open world, foundation vision models are much critical. While recent techniques of self-supervised learning show the promising potentiality of this mission, we argue that signals from labelled data are also important for common-sense recognition, and properly chosen pre-text tasks can facilitate the efficiency of vision representation learning. To this end, we propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner. Specifically, given an image, we take a heuristic way by considering its intrinsic style properties, inside objects with their locations and correlations, and how it looks like in 3D space for basic visual understanding. However, large-scale object bounding boxes and correlations are usually hard to achieve. Alternatively, we develop a hybrid method by leveraging both multi-label classification and self-supervised learning. On the one hand, under the multi-label supervision, the pre-trained model can explore the detailed information of an image, e.g., image types, objects, and part of semantic relations. On the other hand, self-supervised learning tasks, with respect to Masked Image Modeling (MIM) and contrastive learning, can help the model learn pixel details and patch correlations. Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks. For example, with a vanilla Swin-B backbone, we achieve 85.3\% top-1 accuracy on ImageNet-1K classification, 47.9 box AP on COCO object detection for Mask R-CNN, and 50.6 mIoU on ADE-20K semantic segmentation when using Upernet. The performance shows the ability of our vision foundation model to serve general purpose vision tasks.
翻译:为模仿人类识别多样化开放世界的视觉方式,基础视觉模型至关重要。尽管近期自监督学习技术在该任务中展现出潜力,但我们认为标注数据中的信号对常识识别同样重要,且恰当选择的预训练任务能提升视觉表征学习效率。为此,我们提出一种新型预训练框架,通过多任务方式同时采用自监督与监督式视觉预训练任务。具体而言,给定图像时,我们采用启发式方法,综合考虑其内在风格属性、目标及其位置与关联性,以及三维空间中的形态表征,以实现基础视觉理解。然而,大规模目标边界框及其关联关系通常难以获取。我们另辟蹊径,通过结合多标签分类与自监督学习开发了一种混合方法:一方面,在多标签监督下,预训练模型可挖掘图像的细节信息(如图像类型、目标及部分语义关联);另一方面,基于掩码图像建模(MIM)与对比学习的自监督任务,可辅助模型学习像素级细节与图块相关性。实验结果表明,我们的预训练模型在多个视觉任务上取得了与当前最优方法(SOTA)相当或更优的结果。例如,基于标准Swin-B骨干网络,我们在ImageNet-1K分类任务中达到85.3%的top-1准确率,在COCO目标检测任务中基于Mask R-CNN实现47.9的box AP,在ADE-20K语义分割任务中基于Upernet取得50.6的mIoU。该性能表明我们的视觉基础模型具备服务通用视觉任务的能力。