Open-vocabulary dense prediction tasks including object detection and image segmentation have been advanced by the success of Contrastive Language-Image Pre-training (CLIP). CLIP models, particularly those incorporating vision transformers (ViTs), have exhibited remarkable generalization ability in zero-shot image classification. However, when transferring the vision-language alignment of CLIP from global image representation to local region representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer from the domain shift from full images to local image regions. In this paper, we embark on an in-depth analysis of the region-language alignment in CLIP models, which is essential for downstream open-vocabulary dense prediction tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by aligning a region representation extracted from its dense feature map with the image-level representation of the corresponding image crop. With the enhanced CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks. Models and code are released at https://github.com/wusize/CLIPSelf.
翻译:开放词汇密集预测任务(包括目标检测和图像分割)得益于对比语言-图像预训练(CLIP)的成功而取得进展。特别地,采用视觉Transformer(ViT)的CLIP模型在零样本图像分类中展现出卓越的泛化能力。然而,当将CLIP的视觉-语言对齐从全局图像表征迁移至局部区域表征以服务于开放词汇密集预测任务时,CLIP ViT会面临从完整图像到局部图像区域的领域偏移问题。本文对CLIP模型中的区域-语言对齐进行了深入分析——这对于下游开放词汇密集预测任务至关重要。随后,我们提出一种名为CLIPSelf的方法,该方法无需任何区域-文本对即可将CLIP ViT的图像级识别能力适配至局部图像区域。CLIPSelf通过将密集特征图中提取的区域表征与对应图像裁剪的图像级表征进行对齐,使ViT实现自蒸馏。借助增强后的CLIP ViT,我们在多种基准测试上于开放词汇目标检测、语义分割和全景分割任务中取得了新的最优性能。模型与代码已发布于https://github.com/wusize/CLIPSelf。