CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1$\times$ smaller. Moreover, we show that improving caption quality results in $10\times$ data efficiency when finetuning for dense prediction tasks.
翻译:CLIP模型在零样本分类和检索任务上表现出色。但近期研究表明,CLIP学习到的表征并不适用于密集预测任务,如目标检测、语义分割或深度估计。近期,人们引入了CLIP模型的多阶段训练方法,以缓解CLIP在下游任务上的性能不足。本研究发现,提升图像-文本数据集中描述文本的质量即可改善CLIP视觉表征的质量,从而在密集预测视觉下游任务上取得显著提升。事实上,我们发现采用高质量描述文本进行CLIP预训练,能够超越近期监督、自监督和弱监督预训练方法。实验表明,当使用ViT-B/16作为图像编码器的CLIP模型在良好对齐的图像-文本对上进行训练时,其在语义分割和深度估计任务上的平均交并比(mIoU)较近期最先进的掩码图像建模(MIM)预训练方法(如掩码自编码器MAE)提升12.1%,均方根误差(RMSE)降低11.5%。我们还发现轻量级架构同样能从CLIP预训练中显著受益。采用CLIP预训练的近期轻量级视觉架构MCi2,在语义分割任务上获得了与ImageNet-22k预训练的Swin-L相当的性能,而模型体积仅为后者的6.1倍。此外,研究表明提升描述文本质量可使密集预测任务微调的数据效率提升10倍。