Given a long untrimmed video and natural language queries, video grounding (VG) aims to temporally localize the semantically-aligned video segments. Almost all existing VG work holds two simple but unrealistic assumptions: 1) All query sentences can be grounded in the corresponding video. 2) All query sentences for the same video are always at the same semantic scale. Unfortunately, both assumptions make today's VG models fail to work in practice. For example, in real-world multimodal assets (eg, news articles), most of the sentences in the article can not be grounded in their affiliated videos, and they typically have rich hierarchical relations (ie, at different semantic scales). To this end, we propose a new challenging grounding task: Weakly-Supervised temporal Article Grounding (WSAG). Specifically, given an article and a relevant video, WSAG aims to localize all ``groundable'' sentences to the video, and these sentences are possibly at different semantic scales. Accordingly, we collect the first WSAG dataset to facilitate this task: YouwikiHow, which borrows the inherent multi-scale descriptions in wikiHow articles and plentiful YouTube videos. In addition, we propose a simple but effective method DualMIL for WSAG, which consists of a two-level MIL loss and a single-/cross- sentence constraint loss. These training objectives are carefully designed for these relaxed assumptions. Extensive ablations have verified the effectiveness of DualMIL.
翻译:给定一个长视频(未经裁剪)和自然语言查询,视频定位(VG)旨在时间上定位语义对齐的视频片段。几乎所有现有的VG工作都基于两个简单但不现实的假设:1)所有查询句子都可以在相应视频中定位。2)针对同一视频的所有查询句子始终处于相同的语义粒度。不幸的是,这两个假设使得当前的VG模型在实践中无法工作。例如,在现实世界的多模态资产(如新闻文章)中,文章中的大部分句子无法在其附属视频中定位,且这些句子通常具有丰富的层次关系(即处于不同的语义粒度)。为此,我们提出了一项新的具有挑战性的定位任务:弱监督时序文章定位(WSAG)。具体而言,给定一篇文章和一个相关视频,WSAG旨在将所有“可定位”的句子定位到视频中,这些句子可能处于不同的语义粒度。相应地,我们收集了第一个WSAG数据集以促进该任务:YouwikiHow,该数据集借鉴了wikiHow文章中固有的多尺度描述以及丰富的YouTube视频。此外,我们提出了一种简单而有效的方法DualMIL用于WSAG,该方法包括两级多实例学习(MIL)损失和单/跨句子约束损失。这些训练目标是为这些松弛的假设精心设计的。大量消融实验验证了DualMIL的有效性。