Recent DETR-based video grounding models have made the model directly predict moment timestamps without any hand-crafted components, such as a pre-defined proposal or non-maximum suppression, by learning moment queries. However, their input-agnostic moment queries inevitably overlook an intrinsic temporal structure of a video, providing limited positional information. In this paper, we formulate an event-aware dynamic moment query to enable the model to take the input-specific content and positional information of the video into account. To this end, we present two levels of reasoning: 1) Event reasoning that captures distinctive event units constituting a given video using a slot attention mechanism; and 2) moment reasoning that fuses the moment queries with a given sentence through a gated fusion transformer layer and learns interactions between the moment queries and video-sentence representations to predict moment timestamps. Extensive experiments demonstrate the effectiveness and efficiency of the event-aware dynamic moment queries, outperforming state-of-the-art approaches on several video grounding benchmarks.
翻译:近期基于DETR的视频定位模型通过学习时刻查询(moment queries),无需任何手工设计组件(如预定义提案或非极大值抑制)即可直接预测时间戳。然而,这些与输入无关的时刻查询不可避免地忽略了视频固有的时序结构,导致位置信息受限。本文提出一种事件感知的动态时刻查询(event-aware dynamic moment query),使模型能够考虑视频的输入特定内容与位置信息。为此,我们引入两个推理层级:1)事件推理(event reasoning),利用槽注意力机制捕获构成给定视频的独特事件单元;2)时刻推理(moment reasoning),通过门控融合变换层将时刻查询与给定句子融合,并学习时刻查询与视频-句子表征之间的交互以预测时刻时间戳。大量实验证明了事件感知动态时刻查询的有效性与高效性,在多个视频定位基准上超越了当前最优方法。