In this paper, we introduce a novel approach to novel object captioning which employs relative contrastive learning to learn visual and semantic alignment. Our approach maximizes compatibility between regions and object tags in a contrastive manner. To set up a proper contrastive learning objective, for each image, we augment tags by leveraging the relative nature of positive and negative pairs obtained from foundation models such as CLIP. We then use the rank of each augmented tag in a list as a relative relevance label to contrast each top-ranked tag with a set of lower-ranked tags. This learning objective encourages the top-ranked tags to be more compatible with their image and text context than lower-ranked tags, thus improving the discriminative ability of the learned multi-modality representation. We evaluate our approach on two datasets and show that our proposed RCA-NOC approach outperforms state-of-the-art methods by a large margin, demonstrating its effectiveness in improving vision-language representation for novel object captioning.
翻译:本文提出一种面向新物体描述的创新方法,利用相对对比学习实现视觉与语义的对齐。该方法以对比方式最大化图像区域与物体标签间的兼容性。为构建合理的对比学习目标,我们针对每张图像,通过利用从CLIP等基础模型获取的正负样本对的相对特性来增强标签。随后,将每个增强标签在列表中的排序作为相对相关性标签,以对比每个排名靠前的标签与一组排名较低的标签。该学习目标促使排名较高的标签比排名较低的标签更兼容其图像和文本上下文,从而提升所学多模态表示的判别能力。我们在两个数据集上进行了评估,实验结果表明,所提出的RCA-NOC方法显著优于现有最先进方法,验证了其在改进新物体描述任务中视觉-语言表示方面的有效性。