Multimodal-based action recognition methods have achieved high success using pose and RGB modality. However, skeletons sequences lack appearance depiction and RGB images suffer irrelevant noise due to modality limitations. To address this, we introduce human parsing feature map as a novel modality, since it can selectively retain effective semantic features of the body parts, while filtering out most irrelevant noise. We propose a new dual-branch framework called Ensemble Human Parsing and Pose Network (EPP-Net), which is the first to leverage both skeletons and human parsing modalities for action recognition. The first human pose branch feeds robust skeletons in graph convolutional network to model pose features, while the second human parsing branch also leverages depictive parsing feature maps to model parsing festures via convolutional backbones. The two high-level features will be effectively combined through a late fusion strategy for better action recognition. Extensive experiments on NTU RGB+D and NTU RGB+D 120 benchmarks consistently verify the effectiveness of our proposed EPP-Net, which outperforms the existing action recognition methods. Our code is available at: https://github.com/liujf69/EPP-Net-Action.
翻译:基于多模态的动作识别方法通过利用姿态和RGB模态取得了显著成功。然而,骨架序列缺乏外观描述,而RGB图像因模态局限性而易受无关噪声干扰。为解决此问题,我们引入人体解析特征图作为新型模态,因其能选择性保留身体部位的有效语义特征,同时过滤大部分无关噪声。我们提出一种名为集成人体解析与姿态网络(EPP-Net)的新型双分支框架,这是首个同时利用骨架与人体解析模态进行动作识别的方法。第一分支(人体姿态分支)通过图卷积网络对鲁棒骨架建模姿态特征,第二分支(人体解析分支)借助具有描述性的解析特征图,经由卷积骨干网络建模解析特征。这两个高层特征将通过后期融合策略有效结合,以实现更优的动作识别。在NTU RGB+D和NTU RGB+D 120基准上的大量实验一致验证了我们提出的EPP-Net的有效性,其性能超越了现有动作识别方法。我们的代码已开源:https://github.com/liujf69/EPP-Net-Action。