Vision-language foundation models, represented by Contrastive language-image pre-training (CLIP), have gained increasing attention for jointly understanding both vision and textual tasks. However, existing approaches primarily focus on training models to match global image representations with textual descriptions, thereby overlooking the critical alignment between local regions and corresponding text tokens. This paper extends CLIP with multi-granularity alignment. Notably, we deliberately construct a new dataset comprising pseudo annotations at various levels of granularities, encompassing image-level, region-level, and pixel-level captions/tags. Accordingly, we develop a unified multi-granularity learning framework, named UMG-CLIP, that simultaneously empowers the model with versatile perception abilities across different levels of detail. Equipped with parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP models and achieves state-of-the-art performance on diverse image understanding benchmarks, including open-world recognition, retrieval, semantic segmentation, and panoptic segmentation tasks. We hope UMG-CLIP can serve as a valuable option for advancing vision-language foundation models.
翻译:以对比语言-图像预训练(CLIP)为代表的视觉语言基础模型,因其对视觉与文本任务的联合理解能力而日益受到关注。然而,现有方法主要聚焦于训练模型匹配全局图像表征与文本描述,从而忽略了局部区域与对应文本标记之间的关键对齐。本文通过多粒度对齐扩展了CLIP模型。值得注意的是,我们精心构建了一个包含图像级、区域级和像素级伪标注描述/标签的多粒度新数据集。据此,我们开发了名为UMG-CLIP的统一多粒度学习框架,该框架同时赋予模型跨不同细节层次的通用感知能力。通过参数高效微调,UMG-CLIP超越了当前广泛使用的CLIP模型,在多种图像理解基准测试(包括开放世界识别、检索、语义分割和全景分割任务)中达到了最先进性能。我们希望UMG-CLIP能成为推动视觉语言基础模型发展的有价值选择。