Video moment retrieval (MR) and highlight detection (HD) based on natural language queries are two highly related tasks, which aim to obtain relevant moments within videos and highlight scores of each video clip. Recently, several methods have been devoted to building DETR-based networks to solve both MR and HD jointly. These methods simply add two separate task heads after multi-modal feature extraction and feature interaction, achieving good performance. Nevertheless, these approaches underutilize the reciprocal relationship between two tasks. In this paper, we propose a task-reciprocal transformer based on DETR (TR-DETR) that focuses on exploring the inherent reciprocity between MR and HD. Specifically, a local-global multi-modal alignment module is first built to align features from diverse modalities into a shared latent space. Subsequently, a visual feature refinement is designed to eliminate query-irrelevant information from visual features for modal interaction. Finally, a task cooperation module is constructed to refine the retrieval pipeline and the highlight score prediction process by utilizing the reciprocity between MR and HD. Comprehensive experiments on QVHighlights, Charades-STA and TVSum datasets demonstrate that TR-DETR outperforms existing state-of-the-art methods. Codes are available at \url{https://github.com/mingyao1120/TR-DETR}.
翻译:基于自然语言查询的视频时刻检索与高亮点检测是两项高度相关的任务,分别旨在获取视频中的相关时刻及每个视频片段的高亮点分数。近期,若干方法致力于构建基于DETR的网络以联合解决MR与HD问题。这些方法在多模态特征提取与特征交互后简单添加两个独立的任务头,取得了良好性能。然而,现有方法未能充分利用两项任务间的互惠关系。本文提出一种基于DETR的任务互惠Transformer(TR-DETR),专注于挖掘MR与HD之间的内在互惠性。具体而言,首先构建局部-全局多模态对齐模块,将来自不同模态的特征对齐至共享隐空间;随后设计视觉特征细化模块,消除视觉特征中与查询无关的信息以促进模态交互;最后构建任务协作模块,利用MR与HD的互惠关系优化检索流程与高亮点分数预测过程。在QVHighlights、Charades-STA及TVSum数据集上的综合实验表明,TR-DETR优于现有最先进方法。代码开源地址:\url{https://github.com/mingyao1120/TR-DETR}。