Visual Grounding (VG) refers to locating a region described by expressions in a specific image, which is a critical topic in vision-language fields. To alleviate the dependence on labeled data, existing unsupervised methods try to locate regions using task-unrelated pseudo-labels. However, a large proportion of pseudo-labels are noisy and diversity scarcity in language taxonomy. Inspired by the advances in V-L pretraining, we consider utilizing the VLP models to realize unsupervised transfer learning in downstream grounding task. Thus, we propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP via exploiting pseudo-language labels to solve VG problem. By elaborating an efficient model structure, we first propose a single-source and multi-source curriculum adapting method for unsupervised VG to progressively sample more reliable cross-modal pseudo-labels to obtain the optimal model, thus achieving implicit knowledge exploiting and denoising. Our method outperforms the existing state-of-the-art unsupervised VG method Pseudo-Q in both single-source and multi-source scenarios with a large margin, i.e., 6.78%~10.67% and 11.39%~24.87% on RefCOCO/+/g datasets, even outperforms existing weakly supervised methods. The code and models will be released at \url{https://github.com/linhuixiao/CLIP-VG}.
翻译:视觉定位(Visual Grounding, VG)旨在根据表达式描述在特定图像中定位对应区域,这是视觉-语言领域的核心课题。为缓解对标注数据的依赖,现有无监督方法尝试利用任务无关的伪标签进行区域定位。然而,大量伪标签存在噪声干扰且语言分类多样性不足。受视觉-语言预训练进展的启发,我们考虑利用VLP模型实现下游定位任务的无监督迁移学习。为此,本文提出CLIP-VG方法——通过挖掘伪语言标签实现CLIP的自步课程自适应来解决VG问题。通过精心设计高效模型结构,我们首先提出面向无监督VG的单源与多源课程自适应方法,逐步采样更可靠的跨模态伪标签以获得最优模型,从而实现对隐式知识的利用与去噪。我们的方法在单源与多源场景下均以显著优势超越现有最先进的无监督VG方法Pseudo-Q:在RefCOCO/+/g数据集上分别提升6.78%~10.67%与11.39%~24.87%,甚至优于部分弱监督方法。代码与模型将发布于\url{https://github.com/linhuixiao/CLIP-VG}。