Video grounding aims to localize the corresponding video moment in an untrimmed video given a language query. Existing methods often address this task in an indirect way, by casting it as a proposal-and-match or fusion-and-detection problem. Solving these surrogate problems often requires sophisticated label assignment during training and hand-crafted removal of near-duplicate results. Meanwhile, existing works typically focus on sparse video grounding with a single sentence as input, which could result in ambiguous localization due to its unclear description. In this paper, we tackle a new problem of dense video grounding, by simultaneously localizing multiple moments with a paragraph as input. From a perspective on video grounding as language conditioned regression, we present an end-to-end parallel decoding paradigm by re-purposing a Transformer-alike architecture (PRVG). The key design in our PRVG is to use languages as queries, and directly regress the moment boundaries based on language-modulated visual representations. Thanks to its simplicity in design, our PRVG framework can be applied in different testing schemes (sparse or dense grounding) and allows for efficient inference without any post-processing technique. In addition, we devise a robust proposal-level attention loss to guide the training of PRVG, which is invariant to moment duration and contributes to model convergence. We perform experiments on two video grounding benchmarks of ActivityNet Captions and TACoS, demonstrating that our PRVG can significantly outperform previous methods. We also perform in-depth studies to investigate the effectiveness of parallel regression paradigm on video grounding.
翻译:视频定位旨在根据语言查询在未修剪视频中定位对应的视频片段。现有方法通常以间接方式处理该任务,将其转化为提议匹配或融合检测问题。解决这些代理问题通常需要在训练期间进行复杂的标签分配,并手动去除近似重复结果。同时,现有工作通常以单句输入进行稀疏视频定位,由于描述模糊可能导致定位歧义。本文提出密集视频定位的新问题,通过段落输入同时定位多个片段。从语言条件回归的视角,我们重新设计类Transformer架构(PRVG),提出端到端并行解码范式。PRVG的核心设计是以语言为查询,直接基于语言调制视觉表示回归片段边界。得益于简洁设计,PRVG框架可适用于不同测试方案(稀疏或密集定位),且无需后处理技术即可高效推理。此外,我们设计了鲁棒的提议级注意力损失函数来指导PRVG训练,该损失对片段时长具有不变性,有助于模型收敛。我们在ActivityNet Captions和TACoS两个视频定位基准上进行实验,证明PRVG显著优于先前方法。我们还通过深入实验研究了并行回归范式在视频定位中的有效性。