Video Action Recognition (VAR) is a challenging task due to its inherent complexities. Though different approaches have been explored in the literature, designing a unified framework to recognize a large number of human actions is still a challenging problem. Recently, Multi-Modal Learning (MML) has demonstrated promising results in this domain. In literature, 2D skeleton or pose modality has often been used for this task, either independently or in conjunction with the visual information (RGB modality) present in videos. However, the combination of pose, visual information, and text attributes has not been explored yet, though text and pose attributes independently have been proven to be effective in numerous computer vision tasks. In this paper, we present the first pose augmented Vision-language model (VLM) for VAR. Notably, our scheme achieves an accuracy of 92.81% and 73.02% on two popular human video action recognition benchmark datasets, UCF-101 and HMDB-51, respectively, even without any video data pre-training, and an accuracy of 96.11% and 75.75% after kinetics pre-training.
翻译:视频动作识别(VAR)因其固有的复杂性而成为一项具有挑战性的任务。尽管文献中已探索了不同方法,但设计一个统一的框架来识别大量人类动作仍是一个难题。近年来,多模态学习(MML)在该领域展现出令人瞩目的成果。已有文献中,二维骨架或姿态模态常被独立或结合视频中的视觉信息(RGB模态)用于该任务。然而,尽管文本与姿态属性已被证实在众多计算机视觉任务中有效,但三者——姿态、视觉信息与文本属性——的联合应用尚未被探索。本文首次提出一种面向VAR的姿态增强型视觉语言模型(VLM)。值得注意的是,我们的方案在无需任何视频数据预训练的情况下,在两个流行的人类视频动作识别基准数据集UCF-101和HMDB-51上分别达到92.81%和73.02%的准确率;经动力学预训练后,准确率进一步提升至96.11%和75.75%。