Partially relevant video retrieval aims to retrieve untrimmed videos using text queries that describe only partial content. However, the inherent asymmetry between brief queries and rich video content inevitably introduces uncertainty into the retrieval process. In this setting, vague queries often induce semantic ambiguity across videos, a challenge that is further exacerbated by the sparse temporal supervision within videos, which fails to provide sufficient matching evidence. To address this, we propose Holmes, a hierarchical evidential learning framework that aggregates multi-granular cross-modal evidence to quantify and model uncertainty explicitly. At the inter-video level, similarity scores are interpreted as evidential support and modeled via a Dirichlet distribution. Based on the proposed three-fold principle, we perform fine-grained query identification, which then guides query-adaptive calibrated learning. At the intra-video level, to accumulate denser evidence, we formulate a soft query-clip alignment via flexible optimal transport with an adaptive dustbin, which alleviates sparse temporal supervision while suppressing spurious local responses. Extensive experiments demonstrate that Holmes outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/ICML26-Holmes.
翻译:部分相关视频检索旨在通过仅描述部分内容的文本查询检索未裁剪视频。然而,简短查询与丰富视频内容之间的固有不平衡性不可避免地引入了检索过程中的不确定性。在此设定下,模糊查询常导致视频间的语义歧义,而视频内稀疏的时间监督进一步加剧了这一挑战,因其无法提供充分的匹配证据。为解决此问题,我们提出Holmes——一个层次化实证学习框架,通过聚合多粒度跨模态证据来显式量化并建模不确定性。在视频间层面,相似度分数被解释为实证支持,并通过狄利克雷分布建模。基于提出的三要素原则,我们执行细粒度查询识别,进而指导查询自适应校准学习。在视频内层面,为积累更密集的证据,我们通过带自适应垃圾桶的柔性最优传输制定软查询-片段对齐,这既缓解了稀疏时间监督,又抑制了虚假局部响应。大量实验表明,Holmes优于现有最先进方法。代码已发布于https://github.com/lijun2005/ICML26-Holmes。