With the development of video understanding, there is a proliferation of tasks for clip-level temporal video analysis, including temporal action detection (TAD), temporal action segmentation (TAS), and generic event boundary detection (GEBD). While task-specific video understanding models have exhibited outstanding performance in each task, there remains a dearth of a unified framework capable of simultaneously addressing multiple tasks, which is a promising direction for the next generation of AI. To this end, in this paper, we propose a single unified framework, coined as Temporal2Seq, to formulate the output of these temporal video understanding tasks as a sequence of discrete tokens. With this unified token representation, Temporal2Seq can train a generalist model within a single architecture on different video understanding tasks. In the absence of multi-task learning (MTL) benchmarks, we compile a comprehensive co-training dataset by borrowing the datasets from TAD, TAS, and GEBD tasks. We evaluate our Temporal2Seq generalist model on the corresponding test sets of three tasks, demonstrating that Temporal2Seq can produce reasonable results on various tasks and achieve advantages compared with single-task training on this framework. We also investigate the generalization performance of our generalist model on new datasets from different tasks, which yields superior performance to the specific model.
翻译:随着视频理解技术的发展,针对片段级时序视频分析的任务不断涌现,包括时序动作检测(TAD)、时序动作分割(TAS)和通用事件边界检测(GEBD)。尽管针对特定任务的视频理解模型在各自任务中表现出色,但目前仍缺乏能够同时处理多项任务的统一框架,而这正是下一代人工智能发展的一个前景广阔的方向。为此,本文提出一个名为Temporal2Seq的单一统一框架,将上述时序视频理解任务的输出形式化为离散标记序列。借助这种统一的标记表示,Temporal2Seq能够在单一架构内针对不同的视频理解任务训练通用模型。在缺乏多任务学习(MTL)基准的情况下,我们通过整合TAD、TAS和GEBD任务的数据集,构建了一个全面的协同训练数据集。我们在三个任务对应的测试集上评估了Temporal2Seq通用模型,结果表明该框架能够在各项任务上产生合理结果,并展现出相较于单任务训练的优势。我们还探究了该通用模型在不同任务的新数据集上的泛化性能,其表现优于专用模型。