The recent video grounding works attempt to introduce vanilla contrastive learning into video grounding. However, we claim that this naive solution is suboptimal. Contrastive learning requires two key properties: (1) \emph{alignment} of features of similar samples, and (2) \emph{uniformity} of the induced distribution of the normalized features on the hypersphere. Due to two annoying issues in video grounding: (1) the co-existence of some visual entities in both ground truth and other moments, \ie semantic overlapping; (2) only a few moments in the video are annotated, \ie sparse annotation dilemma, vanilla contrastive learning is unable to model the correlations between temporally distant moments and learned inconsistent video representations. Both characteristics lead to vanilla contrastive learning being unsuitable for video grounding. In this paper, we introduce Geodesic and Game Localization (G2L), a semantically aligned and uniform video grounding framework via geodesic and game theory. We quantify the correlations among moments leveraging the geodesic distance that guides the model to learn the correct cross-modal representations. Furthermore, from the novel perspective of game theory, we propose semantic Shapley interaction based on geodesic distance sampling to learn fine-grained semantic alignment in similar moments. Experiments on three benchmarks demonstrate the effectiveness of our method.
翻译:近期视频定位工作试图将朴素对比学习引入该领域。然而,我们认为这种简单方案并非最优。对比学习需要两个关键特性:(1)相似样本特征的**对齐性**;(2)超球面上归一化特征诱导分布的**均匀性**。由于视频定位中存在两个棘手问题:(1)真实时刻与其他时刻中部分视觉实体共存,即**语义重叠**;(2)视频中仅有少数时刻被标注,即**稀疏标注困境**,朴素对比学习无法建模时序上相隔较远时刻之间的相关性,导致学习出不一致的视频表征。这两个特性共同导致朴素对比学习不适用于视频定位。本文提出基于测地线与博弈论的语义对齐且均匀的视频定位框架G2L(Geodesic and Game Localization)。我们利用测地距离量化时刻间相关性,引导模型学习正确的跨模态表征。此外,从博弈论的新视角出发,提出基于测地距离采样的语义沙普利交互方法,学习相似时刻中的细粒度语义对齐。在三个基准数据集上的实验证明了本方法的有效性。