Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecting other available information sources (e.g., class name text) and the concurrent occurrence of multiple actions. To overcome these limitations, we propose a new approach called 'actor-agnostic multi-modal multi-label action recognition,' which offers a unified solution for various types of actors, including humans and animals. We further formulate a novel Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object detection framework (e.g., DETR), characterized by leveraging visual and textual modalities to represent the action classes better. The elimination of actor-specific model designs is a key advantage, as it removes the need for actor pose estimation altogether. Extensive experiments on five publicly available benchmarks show that our MSQNet consistently outperforms the prior arts of actor-specific alternatives on human and animal single- and multi-label action recognition tasks by up to 50%. Code will be released at https://github.com/mondalanindya/MSQNet.
翻译:现有动作识别方法通常因主体间固有的拓扑结构和外观差异而依赖于特定主体。这要求针对不同主体(如人类与动物)进行特定的姿态估计,导致模型设计复杂且维护成本高昂。此外,这些方法往往仅专注于学习视觉模态和单标签分类,而忽略了其他可用信息源(例如类别名称文本)以及多个动作的同时发生。为克服这些局限性,我们提出了一种名为“面向任意主体的多模态多标签动作识别”的新方法,为包括人类和动物在内的各类主体提供了统一解决方案。我们进一步构建了一种基于Transformer目标检测框架(如DETR)的新型多模态语义查询网络(MSQNet)模型,其特点在于利用视觉和文本模态更好地表征动作类别。消除特定主体的模型设计是一个关键优势,因为它完全免除了主体姿态估计的需求。在五个公开基准上的大量实验表明,我们的MSQNet在人类和动物的单标签与多标签动作识别任务中,较先前基于特定主体的方法持续取得高达50%的性能提升。代码将发布于https://github.com/mondalanindya/MSQNet。