Weakly-Supervised Video Moment Retrieval via Regularized Two-Branch Proposal Networks with Erasing Mechanism

Video moment retrieval is to identify the target moment according to the given sentence in an untrimmed video. Due to temporal boundary annotations of the video are extremely time-consuming to acquire, modeling in the weakly-supervised setting is increasingly focused, where we only have access to the video-sentence pairs during training. Most existing weakly-supervised methods adopt a MIL-based framework to develop inter-sample confrontment, but neglect the intra-sample confrontment between moments with similar semantics. Therefore, these methods fail to distinguish the correct moment from plausible negative moments. Further, the previous attention models in cross-modal interaction tend to focus on a few dominant words exorbitantly, ignoring the comprehensive video-sentence correspondence. In this paper, we propose a novel Regularized Two-Branch Proposal Network with Erasing Mechanism to consider the inter-sample and intra-sample confrontments simultaneously. Concretely, we first devise a language-aware visual filter to generate both enhanced and suppressed video streams. Then, we design the sharable two-branch proposal module to generate positive and plausible negative proposals from the enhanced and suppressed branch respectively, contributing to sufficient confrontment. Besides, we introduce an attention-guided dynamic erasing mechanism in enhanced branch to discover the complementary video-sentence relation. Moreover, we apply two types of proposal regularization to stabilize the training process and improve model performance. The extensive experiments on ActivityCaption, Charades-STA and DiDeMo datasets show the effectiveness of our method.

翻译：视频时刻检索旨在根据给定的句子，在未修剪的视频中定位目标时刻。由于视频的时间边界标注极其耗时，弱监督设置下的建模日益受到关注，在这种设置中，训练时我们仅能获取视频-句子配对。大多数现有弱监督方法采用基于多实例学习的框架来开发样本间的对抗，但忽视了语义相似时刻之间的样本内对抗。因此，这些方法难以从看似合理的负例时刻中区分出正确时刻。此外，先前跨模态交互中的注意力模型往往过度聚焦于少数主导词汇，忽略了全面的视频-句子对应关系。本文提出了一种新颖的带擦除机制的正则化双分支提议网络，以同时考虑样本间和样本内对抗。具体而言，我们首先设计了一个语言感知的视觉滤波器，以生成增强和抑制两种视频流。然后，我们设计了可共享的双分支提议模块，分别从增强分支和抑制分支生成正提议和看似合理的负提议，从而实现充分的对抗。此外，我们在增强分支中引入了一种注意力引导的动态擦除机制，以发现互补的视频-句子关系。同时，我们应用了两种类型的提议正则化来稳定训练过程并提升模型性能。在ActivityCaption、Charades-STA和DiDeMo数据集上的大量实验表明了我们方法的有效性。