Vision-Language Pre-training has demonstrated its remarkable zero-shot recognition ability and potential to learn generalizable visual representations from language supervision. Taking a step ahead, language-supervised semantic segmentation enables spatial localization of textual inputs by learning pixel grouping solely from image-text pairs. Nevertheless, the state-of-the-art suffers from clear semantic gaps between visual and textual modality: plenty of visual concepts appeared in images are missing in their paired captions. Such semantic misalignment circulates in pre-training, leading to inferior zero-shot performance in dense predictions due to insufficient visual concepts captured in textual representations. To close such semantic gap, we propose Concept Curation (CoCu), a pipeline that leverages CLIP to compensate for the missing semantics. For each image-text pair, we establish a concept archive that maintains potential visually-matched concepts with our proposed vision-driven expansion and text-to-vision-guided ranking. Relevant concepts can thus be identified via cluster-guided sampling and fed into pre-training, thereby bridging the gap between visual and textual semantics. Extensive experiments over a broad suite of 8 segmentation benchmarks show that CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin, suggesting the value of bridging semantic gap in pre-training data.
翻译:摘要:视觉-语言预训练已展现出卓越的零样本识别能力,以及从语言监督中学习可泛化视觉表征的潜力。在此基础上,语言监督语义分割通过仅从图像-文本对中学习像素分组,实现了文本输入的空间定位。然而,当前最先进方法仍面临视觉与文本模态间的显著语义鸿沟:图像中出现的众多视觉概念在其配对标题中缺失。这种语义错位在预训练过程中循环累积,由于文本表征中捕获的视觉概念不足,导致密集预测任务的零样本性能欠佳。为弥合这一语义鸿沟,我们提出概念策展(CoCu)流程,利用CLIP补偿缺失语义。针对每个图像-文本对,我们构建概念档案库,通过所提出的视觉驱动扩展与文本到视觉引导排序,维护潜在视觉匹配概念。通过聚类引导采样可识别相关概念,并将其输入预训练过程,从而弥合视觉与文本语义间的鸿沟。在8个分割基准上的大量实验表明,CoCu获得了卓越的零样本迁移性能,并大幅提升了语言监督分割基线,这印证了弥合预训练数据语义鸿沟的重要价值。