Image-Text pretraining on web-scale image caption datasets has become the default recipe for open vocabulary classification and retrieval models thanks to the success of CLIP and its variants. Several works have also used CLIP features for dense prediction tasks and have shown the emergence of open-set abilities. However, the contrastive objective used by these models only focuses on image-text alignment and does not incentivise image feature learning for dense prediction tasks. In this work, we introduce SILC, a novel framework for vision language pretraining. SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation. We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation, while also providing improvements on image-level tasks such as classification and retrieval. SILC models sets a new state of the art for zero-shot classification, few shot classification, image and text retrieval, zero-shot segmentation, and open vocabulary segmentation. We further show that SILC features greatly benefit open vocabulary detection, captioning and visual question answering.
翻译:基于CLIP及其变体的成功,在网络规模图像标题数据集上进行图像-文本预训练已成为开放词汇分类和检索模型的默认方案。多项研究还利用CLIP特征进行密集预测任务,展现了开放集能力的出现。然而,此类模型使用的对比目标仅关注图像-文本对齐,并未激励用于密集预测任务的图像特征学习。本文提出SILC——一种新颖的视觉语言预训练框架。SILC通过自蒸馏引入局部到全局对应学习,简单有效地改进了图像-文本对比学习。研究表明,从指数移动平均(EMA)教师模型中蒸馏局部图像特征,能显著提升检测和分割等密集预测任务的模型性能,同时分类和检索等图像级任务也获得改善。SILC模型在零样本分类、小样本分类、图像与文本检索、零样本分割以及开放词汇分割任务上均达到新最优水平。进一步实验表明,SILC特征对开放词汇检测、图像描述生成和视觉问答任务具有显著增益。