Video understanding tasks take many forms, from action detection to visual query localization and spatio-temporal grounding of sentences. These tasks differ in the type of inputs (only video, or video-query pair where query is an image region or sentence) and outputs (temporal segments or spatio-temporal tubes). However, at their core they require the same fundamental understanding of the video, i.e., the actors and objects in it, their actions and interactions. So far these tasks have been tackled in isolation with individual, highly specialized architectures, which do not exploit the interplay between tasks. In contrast, in this paper, we present a single, unified model for tackling query-based video understanding in long-form videos. In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark which entail queries of three different forms: given an egocentric video and a visual, textual or activity query, the goal is to determine when and where the answer can be seen within the video. Our model design is inspired by recent query-based approaches to spatio-temporal grounding, and contains modality-specific query encoders and task-specific sliding window inference that allow multi-task training with diverse input modalities and different structured outputs. We exhaustively analyze relationships among the tasks and illustrate that cross-task learning leads to improved performance on each individual task, as well as the ability to generalize to unseen tasks, such as zero-shot spatial localization of language queries.
翻译:视频理解任务涵盖多种形式,包括动作检测、视觉查询定位以及句子的时空定位。这些任务在输入类型(仅视频,或视频-查询对,其中查询为图像区域或句子)和输出(时间片段或时空管)上存在差异。然而,其核心均需要对视频进行相同的基础理解,即其中的角色与物体、它们的动作及交互。迄今为止,这些任务均通过单独的特化架构孤立处理,未能利用任务间的相互作用。相比之下,本文提出了一种统一的单一模型,用于解决长视频中基于查询的理解问题。具体而言,我们的模型能够处理Ego4D情景记忆基准测试中的所有三项任务,这三项任务包含三种不同形式的查询:给定第一人称视频及视觉、文本或活动查询,目标是在视频中确定答案出现的时间和位置。模型设计受近期时空定位的查询方法启发,包含模态特定的查询编码器与任务特定的滑动窗口推理,支持具有多样输入模态和不同结构化输出的多任务训练。我们全面分析了任务间的关系,并说明跨任务学习可提升各单独任务的性能,同时具备泛化至未见任务(如语言查询的零样本空间定位)的能力。