Adapting large-scale image-text pre-training models, e.g., CLIP, to the video domain represents the current state-of-the-art for text-video retrieval. The primary approaches involve transferring text-video pairs to a common embedding space and leveraging cross-modal interactions on specific entities for semantic alignment. Though effective, these paradigms entail prohibitive computational costs, leading to inefficient retrieval. To address this, we propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL), which capitalizes on latent shared semantics across modalities for text-video retrieval. Specifically, we introduce a parameter-free global interaction module to explore coarse-grained alignment. Then, we devise a shared local interaction module that employs several learnable queries to capture latent semantic concepts for learning fine-grained alignment. Furthermore, an Inter-Consistency Loss (ICL) is devised to accomplish the concept alignment between the visual query and corresponding textual query, and an Intra-Diversity Loss (IDL) is developed to repulse the distribution within visual (textual) queries to generate more discriminative concepts. Extensive experiments on five widely used benchmarks (i.e., MSR-VTT, MSVD, DiDeMo, LSMDC, and ActivityNet) substantiate the superior effectiveness and efficiency of the proposed method. Remarkably, our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost. Code is available at: https://github.com/zchoi/GLSCL.
翻译:将大规模图像-文本预训练模型(如CLIP)适配至视频领域是当前文本-视频检索任务的主流方法。主流方案通过将文本-视频对映射至共享嵌入空间,并利用跨模态交互对特定实体进行语义对齐。尽管这些范式有效,但其计算成本高昂,导致检索效率低下。针对此问题,我们提出一种简洁高效的方法——全局-局部语义一致性学习(GLSCL),该方法利用跨模态间的潜在共享语义实现文本-视频检索。具体而言,我们引入无参数的全局交互模块探索粗粒度对齐,随后设计共享局部交互模块,通过可学习查询捕获潜在语义概念以学习细粒度对齐。此外,我们提出跨模态一致性损失(ICL)实现视觉查询与对应文本查询的概念对齐,并提出模态内多样性损失(IDL)推动视觉(文本)查询内部差异化分布,以生成更具判别性的概念。在五个广泛使用的基准(MSR-VTT、MSVD、DiDeMo、LSMDC和ActivityNet)上的大量实验证明了所提方法的优越有效性与计算效率。值得注意的是,我们的方法在达到与当前最优方法(SOTA)相当性能的同时,计算速度提升近220倍。代码开源地址:https://github.com/zchoi/GLSCL。