Text-video retrieval aims to find the most semantically similar videos with given text queries. However, since videos contain more diverse content than texts, the main semantics expressed by each text-video pair is often partially relevant. The primary methods involve the utilization of language-video attention module to align texts and videos. Though effective, this paradigm inevitably introduces prohibitive computational overhead, resulting in inefficient retrieval. In this paper, we propose a simple yet effective method called Global-Local Contrastive Consistency Learning (GLCCL) to achieve texts and videos semantics alignment. Specifically, we design a parameter-free Global-Local Interaction Module (GLIM) to generate semantic-related frame and video features in a text-guided manner. Furthermore, a Contrastive Score Consistency (CSC) loss is developed to promote consistency learning among different scores on positive pairs and suppress consistency learning on negative pairs. Empirical evidence suggests that CSC loss provides the model with robust discriminative power between positives and hard negatives. Extensive experiments on three benchmark datasets, including MSR-VTT, DiDeMo and VATEX, demonstrate the effectiveness and superiority of our approach.
翻译:文本-视频检索旨在找出与给定文本查询语义最相似的视频。然而,由于视频包含的内容比文本更多样化,每个文本-视频对所表达的主要语义往往是部分相关的。主要方法利用语言-视频注意力模块来对齐文本和视频。虽然有效,但这种范式不可避免地引入了巨大的计算开销,导致检索效率低下。在本文中,我们提出了一种简单而有效的方法,称为全局-局部对比一致性学习(GLCCL),以实现文本和视频的语义对齐。具体来说,我们设计了一个无参数的全局-局部交互模块(GLIM),以文本引导的方式生成语义相关的帧和视频特征。此外,我们开发了一种对比分数一致性(CSC)损失,以促进正样本对之间不同分数的一致性学习,并抑制负样本对的一致性学习。实验证据表明,CSC损失为模型提供了强大的区分正样本和难负样本的能力。在MSR-VTT、DiDeMo和VATEX三个基准数据集上的大量实验证明了我们方法的有效性和优越性。