Image-Text pretraining on web-scale image caption dataset has become the default recipe for open vocabulary classification and retrieval models thanks to the success of CLIP and its variants. Several works have also used CLIP features for dense prediction tasks and have shown the emergence of open-set abilities. However, the contrastive objective only focuses on image-text alignment and does not incentivise image feature learning for dense prediction tasks. In this work, we propose the simple addition of local-to-global correspondence learning by self-distillation as an additional objective for contrastive pre-training to propose SILC. We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on several computer vision tasks including classification, retrieval, and especially segmentation. We further show that SILC scales better with the same training duration compared to the baselines. Our model SILC sets a new state of the art for zero-shot classification, few shot classification, image and text retrieval, zero-shot segmentation, and open vocabulary segmentation.
翻译:基于网络规模图像描述数据集的图像-文本预训练,凭借CLIP及其变体的成功,已成为开放词汇分类与检索模型的默认方案。已有研究利用CLIP特征完成密集预测任务并展现了开放集能力的涌现。然而,对比学习目标仅聚焦于图像-文本对齐,未能激励密集预测任务所需的图像特征学习。为此,我们提出在对比预训练中引入自蒸馏的局部到全局对应学习作为附加目标,由此构建SILC模型。实验表明,通过指数移动平均(EMA)教师模型蒸馏局部图像特征,可显著提升模型在分类、检索及尤其是分割等多项计算机视觉任务上的性能。我们进一步证明,在相同训练时长下,SILC相比基线方法具有更强的可扩展性。SILC模型在零样本分类、小样本分类、图像与文本检索、零样本分割及开放词汇分割任务中均达到了当前最优水平。