Video moment retrieval (MR) and highlight detection (HD) based on natural language queries are two highly related tasks, which aim to obtain relevant moments within videos and highlight scores of each video clip. Recently, several methods have been devoted to building DETR-based networks to solve both MR and HD jointly. These methods simply add two separate task heads after multi-modal feature extraction and feature interaction, achieving good performance. Nevertheless, these approaches underutilize the reciprocal relationship between two tasks. In this paper, we propose a task-reciprocal transformer based on DETR (TR-DETR) that focuses on exploring the inherent reciprocity between MR and HD. Specifically, a local-global multi-modal alignment module is first built to align features from diverse modalities into a shared latent space. Subsequently, a visual feature refinement is designed to eliminate query-irrelevant information from visual features for modal interaction. Finally, a task cooperation module is constructed to refine the retrieval pipeline and the highlight score prediction process by utilizing the reciprocity between MR and HD. Comprehensive experiments on QVHighlights, Charades-STA and TVSum datasets demonstrate that TR-DETR outperforms existing state-of-the-art methods. Codes are available at \url{https://github.com/mingyao1120/TR-DETR}.
翻译:视频时刻检索(MR)和高亮点检测(HD)是基于自然语言查询的两项高度相关任务,旨在获取视频中的相关时刻以及每个视频片段的高亮分数。近期,多种方法致力于构建基于DETR的网络以联合解决MR与HD任务。这些方法在多模态特征提取与特征交互后直接添加两个独立的任务头,取得了良好性能。然而,现有方法未能充分利用MR与HD之间的互惠关系。本文提出一种基于DETR的任务互惠Transformer(TR-DETR),专注于探索MR与HD的内在互惠性。具体而言,首先构建局部-全局多模态对齐模块,将来自不同模态的特征对齐至共享隐空间;进而设计视觉特征精炼模块,消除视觉特征中与查询无关的信息以优化模态交互;最后构建任务协作模块,利用MR与HD的互惠性精炼检索流程与高亮分数预测过程。在QVHighlights、Charades-STA和TVSum数据集上的全面实验表明,TR-DETR优于现有最先进方法。代码已开源于\url{https://github.com/mingyao1120/TR-DETR}。