Visual Grounding (VG) is a crucial topic in the field of vision and language, which involves locating a specific region described by expressions within an image. To reduce the reliance on manually labeled data, unsupervised methods have been developed to locate regions using pseudo-labels. However, the performance of existing unsupervised methods is highly dependent on the quality of pseudo-labels and these methods always encounter issues with limited diversity. In order to utilize vision and language pre-trained models to address the grounding problem, and reasonably take advantage of pseudo-labels, we propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP with pseudo-language labels. We propose a simple yet efficient end-to-end network architecture to realize the transfer of CLIP to the visual grounding. Based on the CLIP-based architecture, we further propose single-source and multi-source curriculum adapting algorithms, which can progressively find more reliable pseudo-labels to learn an optimal model, thereby achieving a balance between reliability and diversity for the pseudo-language labels. Our method outperforms the current state-of-the-art unsupervised method by a significant margin on RefCOCO/+/g datasets in both single-source and multi-source scenarios, with improvements ranging from 6.78% to 10.67% and 11.39% to 14.87%, respectively. Furthermore, our approach even outperforms existing weakly supervised methods. The code and models are available at https://github.com/linhuixiao/CLIP-VG.
翻译:视觉定位(Visual Grounding, VG)是视觉与语言领域中的关键课题,旨在定位图像中由语言表达式描述的特定区域。为减少对人工标注数据的依赖,现有无监督方法通过生成伪标签来定位区域。然而,此类方法的性能高度依赖于伪标签的质量,且常面临多样性不足的问题。为充分利用视觉-语言预训练模型解决定位任务,并合理利用伪标签,本文提出CLIP-VG——一种基于CLIP的自步课程自适应方法,其核心是利用伪语言标签对CLIP模型进行适应性调整。我们设计了一种简单高效的端到端网络架构,实现CLIP向视觉定位任务的迁移。基于该CLIP架构,进一步提出单源与多源课程自适应算法,能够逐步发现更可靠的伪标签以学习最优模型,从而在伪语言标签的可靠性与多样性之间取得平衡。在RefCOCO/+/g数据集上的单源与多源场景中,本方法显著超越当前最优的无监督方法,性能提升分别达6.78%至10.67%与11.39%至14.87%。此外,本方法性能甚至优于现有弱监督方法。代码与模型已开源至https://github.com/linhuixiao/CLIP-VG。