The canonical approach to video-text retrieval leverages a coarse-grained or fine-grained alignment between visual and textual information. However, retrieving the correct video according to the text query is often challenging as it requires the ability to reason about both high-level (scene) and low-level (object) visual clues and how they relate to the text query. To this end, we propose a Unified Coarse-to-fine Alignment model, dubbed UCoFiA. Specifically, our model captures the cross-modal similarity information at different granularity levels. To alleviate the effect of irrelevant visual clues, we also apply an Interactive Similarity Aggregation module (ISA) to consider the importance of different visual features while aggregating the cross-modal similarity to obtain a similarity score for each granularity. Finally, we apply the Sinkhorn-Knopp algorithm to normalize the similarities of each level before summing them, alleviating over- and under-representation issues at different levels. By jointly considering the crossmodal similarity of different granularity, UCoFiA allows the effective unification of multi-grained alignments. Empirically, UCoFiA outperforms previous state-of-the-art CLIP-based methods on multiple video-text retrieval benchmarks, achieving 2.4%, 1.4% and 1.3% improvements in text-to-video retrieval R@1 on MSR-VTT, Activity-Net, and DiDeMo, respectively. Our code is publicly available at https://github.com/Ziyang412/UCoFiA.
翻译:视频文本检索的经典方法依赖于视觉与文本信息之间的粗粒度或细粒度对齐。然而,根据文本查询检索正确视频往往具有挑战性,因为这需要具备推理高层级(场景)和低层级(物体)视觉线索及其与文本查询关系的能力。为此,我们提出一种统一粗到细对齐模型(UCoFiA)。具体而言,该模型在不同粒度层级上捕获跨模态相似性信息。为减轻无关视觉线索的影响,我们还引入交互式相似性聚合模块(ISA),在聚合跨模态相似性以获取各粒度相似性得分时,考虑不同视觉特征的重要性。最后,应用Sinkhorn-Knopp算法对各层级相似性进行归一化后再求和,缓解不同层级的过度表征与不足表征问题。通过联合考虑不同粒度的跨模态相似性,UCoFiA实现了多粒度对齐的有效统一。实验表明,UCoFiA在多个视频文本检索基准上优于既往最先进的基于CLIP的方法,在MSR-VTT、Activity-Net和DiDeMo数据集上的文本到视频检索R@1指标分别提升2.4%、1.4%和1.3%。相关代码已开源:https://github.com/Ziyang412/UCoFiA。