With the development of video understanding, there is a proliferation of tasks for clip-level temporal video analysis, including temporal action detection (TAD), temporal action segmentation (TAS), and generic event boundary detection (GEBD). While task-specific video understanding models have exhibited outstanding performance in each task, there remains a dearth of a unified framework capable of simultaneously addressing multiple tasks, which is a promising direction for the next generation of AI. To this end, in this paper, we propose a single unified framework, coined as Temporal2Seq, to formulate the output of these temporal video understanding tasks as a sequence of discrete tokens. With this unified token representation, Temporal2Seq can train a generalist model within a single architecture on different video understanding tasks. In the absence of multi-task learning (MTL) benchmarks, we compile a comprehensive co-training dataset by borrowing the datasets from TAD, TAS, and GEBD tasks. We evaluate our Temporal2Seq generalist model on the corresponding test sets of three tasks, demonstrating that Temporal2Seq can produce reasonable results on various tasks and achieve advantages compared with single-task training on this framework. We also investigate the generalization performance of our generalist model on new datasets from different tasks, which yields superior performance to the specific model.
翻译:随着视频理解技术的发展,针对片段级时域视频分析的任务日益增多,包括时域动作检测(TAD)、时域动作分割(TAS)和通用事件边界检测(GEBD)。虽然面向特定任务的视频理解模型在各自任务中表现出色,但能够同时处理多个任务的统一框架仍然匮乏,而这正是下一代人工智能的前进方向。为此,本文提出一个名为Temporal2Seq的统一框架,将这些时域视频理解任务的输出形式化为离散词元序列。借助这种统一的词元表示,Temporal2Seq能够在单一架构下针对不同视频理解任务训练通用模型。鉴于缺乏多任务学习(MTL)基准,我们通过整合TAD、TAS和GEBD任务的数据集,构建了全面的协同训练数据集。我们在三个任务对应的测试集上评估了Temporal2Seq通用模型,结果表明该模型能够在多种任务上产生合理结果,且相较该框架下的单任务训练具有优势。此外,我们还探究了通用模型在来自不同任务的新数据集上的泛化性能,其表现优于特定任务模型。