Support-Set Based Cross-Supervision for Video Grounding

Current approaches for video grounding propose kinds of complex architectures to capture the video-text relations, and have achieved impressive improvements. However, it is hard to learn the complicated multi-modal relations by only architecture designing in fact. In this paper, we introduce a novel Support-set Based Cross-Supervision (Sscs) module which can improve existing methods during training phase without extra inference cost. The proposed Sscs module contains two main components, i.e., discriminative contrastive objective and generative caption objective. The contrastive objective aims to learn effective representations by contrastive learning, while the caption objective can train a powerful video encoder supervised by texts. Due to the co-existence of some visual entities in both ground-truth and background intervals, i.e., mutual exclusion, naively contrastive learning is unsuitable to video grounding. We address the problem by boosting the cross-supervision with the support-set concept, which collects visual information from the whole video and eliminates the mutual exclusion of entities. Combined with the original objectives, Sscs can enhance the abilities of multi-modal relation modeling for existing approaches. We extensively evaluate Sscs on three challenging datasets, and show that our method can improve current state-of-the-art methods by large margins, especially 6.35% in terms of [email protected] on Charades-STA.

翻译：视频定位的现行方法提出了各种复杂的结构,以捕捉视频文本关系,并取得了令人印象深刻的改进。然而,很难通过事实上的设计架构来了解复杂的多模式关系。在本文中,我们引入了一个新型的“支持-基于基础的交叉浏览(Sscs)”模块,该模块可以在培训阶段改进现有方法,而无需额外的推理费用。提议的Sscs模块包含两个主要组成部分,即:有区别的对比性目标和基因描述目标。对比性目标的目的是通过对比性学习来学习有效的表达方式,而标题目标则可以培养由文本监督的强大视频编码器。由于某些视觉实体在地面-真相和背景间隔中共存,即相互排斥、天真的对比学习不适合视频场景。我们通过支持-设置的概念来强化交叉监督,从整个视频中收集视觉信息,消除实体的相互排斥。与原始目标相结合,Sscscs可以提高当前具有挑战性的R-25S 模式的能力, 特别是以大规模的方式展示我们目前具有挑战性的R-25-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S