We introduce a new task called Referring Atomic Video Action Recognition (RAVAR), aimed at identifying atomic actions of a particular person based on a textual description and the video data of this person. This task differs from traditional action recognition and localization, where predictions are delivered for all present individuals. In contrast, we focus on recognizing the correct atomic action of a specific individual, guided by text. To explore this task, we present the RefAVA dataset, containing 36,630 instances with manually annotated textual descriptions of the individuals. To establish a strong initial benchmark, we implement and validate baselines from various domains, e.g., atomic action localization, video question answering, and text-video retrieval. Since these existing methods underperform on RAVAR, we introduce RefAtomNet -- a novel cross-stream attention-driven method specialized for the unique challenges of RAVAR: the need to interpret a textual referring expression for the targeted individual, utilize this reference to guide the spatial localization and harvest the prediction of the atomic actions for the referring person. The key ingredients are: (1) a multi-stream architecture that connects video, text, and a new location-semantic stream, and (2) cross-stream agent attention fusion and agent token fusion which amplify the most relevant information across these streams and consistently surpasses standard attention-based fusion on RAVAR. Extensive experiments demonstrate the effectiveness of RefAtomNet and its building blocks for recognizing the action of the described individual. The dataset and code will be made publicly available at https://github.com/KPeng9510/RAVAR.
翻译:我们提出了一项名为指代性原子视频动作识别(RAVAR)的新任务,旨在基于文本描述和该人物的视频数据识别特定人物的原子动作。此任务不同于传统的动作识别与定位,后者需对所有在场个体进行预测。相比之下,我们专注于在文本引导下识别特定个体的正确原子动作。为探索此任务,我们提出了RefAVA数据集,包含36,630个实例,其中个体的文本描述均为人工标注。为建立坚实的初始基准,我们实现并验证了来自多个领域的基线方法,例如原子动作定位、视频问答以及文本-视频检索。由于这些现有方法在RAVAR任务上表现欠佳,我们提出了RefAtomNet——一种新颖的跨流注意力驱动方法,专门针对RAVAR的独特挑战设计:需要解析针对目标个体的文本指代表达式,利用此指代信息引导空间定位,并提取被指代个体的原子动作预测。其核心构成包括:(1)连接视频流、文本流及新引入的位置-语义流的多流架构;(2)跨流智能体注意力融合与智能体令牌融合机制,该机制能放大各流中最相关的信息,并在RAVAR任务上持续超越基于标准注意力的融合方法。大量实验证明了RefAtomNet及其组成模块在识别被描述个体动作方面的有效性。数据集与代码将在https://github.com/KPeng9510/RAVAR 公开提供。