Temporal Action Detection (TAD) focuses on detecting pre-defined actions, while Moment Retrieval (MR) aims to identify the events described by open-ended natural language within untrimmed videos. Despite that they focus on different events, we observe they have a significant connection. For instance, most descriptions in MR involve multiple actions from TAD. In this paper, we aim to investigate the potential synergy between TAD and MR. Firstly, we propose a unified architecture, termed Unified Moment Detection (UniMD), for both TAD and MR. It transforms the inputs of the two tasks, namely actions for TAD or events for MR, into a common embedding space, and utilizes two novel query-dependent decoders to generate a uniform output of classification score and temporal segments. Secondly, we explore the efficacy of two task fusion learning approaches, pre-training and co-training, in order to enhance the mutual benefits between TAD and MR. Extensive experiments demonstrate that the proposed task fusion learning scheme enables the two tasks to help each other and outperform the separately trained counterparts. Impressively, UniMD achieves state-of-the-art results on three paired datasets Ego4D, Charades-STA, and ActivityNet. Our code is available at https://github.com/yingsen1/UniMD.
翻译:时序动作检测(TAD)专注于检测预定义的动作,而时刻检索(MR)旨在从未修剪视频中识别由开放域自然语言描述的事件。尽管两者关注不同的事件,我们观察到它们之间存在显著关联。例如,MR中的大多数描述都涉及TAD中的多个动作。本文旨在探究TAD与MR之间潜在的协同效应。首先,我们提出一个称为统一时刻检测(UniMD)的统一架构,可同时处理TAD与MR任务。该架构将两个任务的输入(即TAD的动作或MR的事件)映射到公共嵌入空间,并利用两个新颖的查询相关解码器生成分类分数与时序片段的统一输出。其次,我们探索了预训练与协同训练这两种任务融合学习方法的有效性,以增强TAD与MR之间的互利效应。大量实验表明,所提出的任务融合学习方案能使两个任务相互促进,并显著优于独立训练的模型。值得注意的是,UniMD在Ego4D、Charades-STA和ActivityNet三个配对数据集上均取得了最先进的性能。我们的代码公开于https://github.com/yingsen1/UniMD。