Human perception of surroundings is often guided by the various poses present within the environment. Many computer vision tasks, such as human action recognition and robot imitation learning, rely on pose-based entities like human skeletons or robotic arms. However, conventional Vision Transformer (ViT) models uniformly process all patches, neglecting valuable pose priors in input videos. We argue that incorporating poses into RGB data is advantageous for learning fine-grained and viewpoint-agnostic representations. Consequently, we introduce two strategies for learning pose-aware representations in ViTs. The first method, called Pose-aware Attention Block (PAAB), is a plug-and-play ViT block that performs localized attention on pose regions within videos. The second method, dubbed Pose-Aware Auxiliary Task (PAAT), presents an auxiliary pose prediction task optimized jointly with the primary ViT task. Although their functionalities differ, both methods succeed in learning pose-aware representations, enhancing performance in multiple diverse downstream tasks. Our experiments, conducted across seven datasets, reveal the efficacy of both pose-aware methods on three video analysis tasks, with PAAT holding a slight edge over PAAB. Both PAAT and PAAB surpass their respective backbone Transformers by up to 9.8% in real-world action recognition and 21.8% in multi-view robotic video alignment. Code is available at https://github.com/dominickrei/PoseAwareVT.
翻译:人类对环境的感知往往受到环境中存在的各种姿态的引导。许多计算机视觉任务,例如人体动作识别和机器人模仿学习,都依赖于基于姿态的实体,如人体骨架或机械臂。然而,传统的视觉Transformer(ViT)模型统一处理所有图像块,忽略了输入视频中宝贵的姿态先验信息。我们认为,将姿态信息融入RGB数据有助于学习细粒度和视角无关的表征。为此,我们提出了两种在ViT中学习姿态感知表征的策略。第一种方法称为姿态感知注意力模块(PAAB),这是一种即插即用的ViT模块,可在视频中的姿态区域执行局部注意力。第二种方法称为姿态感知辅助任务(PAAT),它引入了一个辅助姿态预测任务,与主要ViT任务联合优化。尽管两者功能不同,但两种方法均成功学习了姿态感知表征,从而提升了多个不同下游任务的性能。我们在七个数据集上进行的实验揭示了这两种姿态感知方法在三个视频分析任务上的有效性,其中PAAT略优于PAAB。在真实世界动作识别任务中,PAAT和PAAB的性能分别比其骨干Transformer模型提升了高达9.8%;在多视角机器人视频对齐任务中,提升幅度高达21.8%。代码已开源,地址为https://github.com/dominickrei/PoseAwareVT。