Recent endeavors in video temporal grounding enforce strong cross-modal interactions through attention mechanisms to overcome the modality gap between video and text query. However, previous works treat all video clips equally regardless of their semantic relevance with the text query in attention modules. In this paper, our goal is to provide clues for query-associated video clips within the crossmodal encoding process. With our Correlation-Guided Detection Transformer~(CG-DETR), we explore the appropriate clip-wise degree of cross-modal interactions and how to exploit such degrees for prediction. First, we design an adaptive cross-attention layer with dummy tokens. Dummy tokens conditioned by text query take a portion of the attention weights, preventing irrelevant video clips from being represented by the text query. Yet, not all word tokens equally inherit the text query's correlation to video clips. Thus, we further guide the cross-attention map by inferring the fine-grained correlation between video clips and words. We enable this by learning a joint embedding space for high-level concepts, i.e., moment and sentence level, and inferring the clip-word correlation. Lastly, we use a moment-adaptive saliency detector to exploit each video clip's degrees of text engagement. We validate the superiority of CG-DETR with the state-of-the-art results on various benchmarks for both moment retrieval and highlight detection. Codes are available at https://github.com/wjun0830/CGDETR.
翻译:近年来,视频时间定位领域的研究通过注意力机制加强跨模态交互,以弥合视频与文本查询之间的模态差距。然而,现有方法在注意力模块中平等对待所有视频片段,未考虑其与文本查询的语义相关性。本文旨在跨模态编码过程中为与查询相关的视频片段提供线索。我们提出关联引导检测Transformer(CG-DETR),探索合适的片段级跨模态交互程度,并利用该程度进行预测。首先,我们设计带有虚拟令牌的自适应交叉注意力层。由文本查询条件化的虚拟令牌占据部分注意力权重,防止不相关视频片段受到文本查询表征的干扰。然而,并非所有词汇令牌都能平等继承文本查询与视频片段的关联性。为此,我们进一步通过推断视频片段与单词间的细粒度关联来引导交叉注意力图。通过学习高层概念(即片段级与句子级)的联合嵌入空间,实现片段-单词关联性的推断。最后,我们采用片段自适应显著性检测器,挖掘各视频片段的文本参与程度。在多个基准数据集上的时刻检索与高亮检测任务中,CG-DETR均取得最先进结果,验证了其优越性。代码开源于 https://github.com/wjun0830/CGDETR。