Partially Relevant Video Retrieval (PRVR) seeks videos where only part of the content matches a text query. Existing methods treat every annotated text-video pair as a positive and all others as negatives, ignoring the rich semantic variation both within a single video and across different videos. Consequently, embeddings of both queries and their corresponding video-clip segments for distinct events within the same video collapse together, while embeddings of semantically similar queries and segments from different videos are driven apart. This limits retrieval performance when videos contain multiple, diverse events. This paper addresses the aforementioned problems, termed as semantic collapse, in both the text and video embedding spaces. We first introduce Text Correlation Preservation Learning, which preserves the semantic relationships encoded by the foundation model across text queries. To address collapse in video embeddings, we propose Cross-Branch Video Alignment (CBVA), a contrastive alignment method that disentangles hierarchical video representations across temporal scales. Subsequently, we introduce order-preserving token merging and adaptive CBVA to enhance alignment by producing video segments that are internally coherent yet mutually distinctive. Extensive experiments on PRVR benchmarks demonstrate that our framework effectively prevents semantic collapse and substantially improves retrieval accuracy.
翻译:部分相关视频检索(PRVR)旨在检索仅部分内容与文本查询匹配的视频。现有方法将每个标注的文本-视频对视为正样本,其余视为负样本,忽略了单个视频内部及不同视频之间丰富的语义变化。这导致同一视频中不同事件的查询及其对应视频片段嵌入坍缩在一起,而语义相似的查询与来自不同视频的片段嵌入却被推远。当视频包含多个不同事件时,这种限制影响了检索性能。本文针对上述在文本和视频嵌入空间中出现的语义坍缩问题展开研究。我们首先引入文本相关性保持学习,以保留基础模型在文本查询间编码的语义关系。为解决视频嵌入中的坍缩问题,我们提出了跨分支视频对齐(CBVA),这是一种对比对齐方法,可在不同时间尺度上解耦层次化视频表示。随后,我们引入保序令牌合并与自适应CBVA,通过生成内部一致且相互区分的视频片段来增强对齐效果。在PRVR基准数据集上的大量实验表明,我们的框架能有效防止语义坍缩,并显著提升检索准确率。