Video moment localization aims to retrieve the target segment of an untrimmed video according to the natural language query. Weakly supervised methods gains attention recently, as the precise temporal location of the target segment is not always available. However, one of the greatest challenges encountered by the weakly supervised method is implied in the mismatch between the video and language induced by the coarse temporal annotations. To refine the vision-language alignment, recent works contrast the cross-modality similarities driven by reconstructing masked queries between positive and negative video proposals. However, the reconstruction may be influenced by the latent spurious correlation between the unmasked and the masked parts, which distorts the restoring process and further degrades the efficacy of contrastive learning since the masked words are not completely reconstructed from the cross-modality knowledge. In this paper, we discover and mitigate this spurious correlation through a novel proposed counterfactual cross-modality reasoning method. Specifically, we first formulate query reconstruction as an aggregated causal effect of cross-modality and query knowledge. Then by introducing counterfactual cross-modality knowledge into this aggregation, the spurious impact of the unmasked part contributing to the reconstruction is explicitly modeled. Finally, by suppressing the unimodal effect of masked query, we can rectify the reconstructions of video proposals to perform reasonable contrastive learning. Extensive experimental evaluations demonstrate the effectiveness of our proposed method. The code is available at \href{https://github.com/sLdZ0306/CCR}{https://github.com/sLdZ0306/CCR}.
翻译:视频时刻定位旨在根据自然语言查询从无剪辑视频中检索目标片段。弱监督方法因无需目标片段的精确时间定位而受到广泛关注。然而,弱监督方法面临的最大挑战之一在于粗糙时间标注所引发的视频与语言之间的不匹配。为优化视觉-语言对齐,近期研究通过对比基于正负视频提议中掩码查询重构的跨模态相似度来开展工作。然而,重构过程可能受到未掩码部分与掩码部分之间潜在虚假相关性的影响,这会扭曲还原过程,并进一步降低对比学习的有效性,因为掩码词语并非完全从跨模态知识中重构。本文通过提出一种新颖的反事实跨模态推理方法,发现并缓解了这种虚假相关性。具体而言,我们首先将查询重构建模为跨模态知识与查询知识的聚合因果效应。然后通过将反事实跨模态知识引入此聚合过程,显式建模了未掩码部分对重构产生贡献的虚假影响。最后,通过抑制掩码查询的单模态效应,我们可校正视频提议的重构,从而实现合理的对比学习。大量实验评估证明了所提方法的有效性。代码已开源在 \href{https://github.com/sLdZ0306/CCR}{https://github.com/sLdZ0306/CCR}。