While human action recognition has witnessed notable achievements, multimodal methods fusing RGB and skeleton modalities still suffer from their inherent heterogeneity and fail to fully exploit the complementary potential between them. In this paper, we propose PAN, the first human-centric graph representation learning framework for multimodal action recognition, in which token embeddings of RGB patches containing human joints are represented as spatiotemporal graphs. The human-centric graph modeling paradigm suppresses the redundancy in RGB frames and aligns well with skeleton-based methods, thus enabling a more effective and semantically coherent fusion of multimodal features. Since the sampling of token embeddings heavily relies on 2D skeletal data, we further propose attention-based post calibration to reduce the dependency on high-quality skeletal data at a minimal cost interms of model performance. To explore the potential of PAN in integrating with skeleton-based methods, we present two variants: PAN-Ensemble, which employs dual-path graph convolution networks followed by late fusion, and PAN-Unified, which performs unified graph representation learning within a single network. On three widely used multimodal action recognition datasets, both PAN-Ensemble and PAN-Unified achieve state-of-the-art (SOTA) performance in their respective settings of multimodal fusion: separate and unified modeling, respectively.
翻译:尽管人体动作识别已取得显著进展,但融合RGB与骨骼模态的多模态方法仍受限于其固有的异质性,未能充分挖掘两者间的互补潜力。本文提出PAN——首个面向多模态动作识别的人本图表示学习框架,其中包含人体关节的RGB图像块标记嵌入被表示为时空图。这种人本图建模范式抑制了RGB帧中的冗余信息,并与基于骨骼的方法形成良好对齐,从而实现更有效且语义一致的多模态特征融合。由于标记嵌入的采样高度依赖二维骨骼数据,我们进一步提出基于注意力的后校准机制,以在模型性能损失最小的前提下降低对高质量骨骼数据的依赖。为探索PAN与基于骨骼方法结合的潜力,我们提出两种变体:采用双路径图卷积网络配合后期融合的PAN-Ensemble,以及在单网络内执行统一图表示学习的PAN-Unified。在三个广泛使用的多模态动作识别数据集上,PAN-Ensemble与PAN-Unified分别在多模态融合的两种设定(分离建模与统一建模)中取得了最先进的性能表现。