Human action recognition aims at classifying the category of human action from a segment of a video. Recently, people have dived into designing GCN-based models to extract features from skeletons for performing this task, because skeleton representations are much more efficient and robust than other modalities such as RGB frames. However, when employing the skeleton data, some important clues like related items are also discarded. It results in some ambiguous actions that are hard to be distinguished and tend to be misclassified. To alleviate this problem, we propose an auxiliary feature refinement head (FR Head), which consists of spatial-temporal decoupling and contrastive feature refinement, to obtain discriminative representations of skeletons. Ambiguous samples are dynamically discovered and calibrated in the feature space. Furthermore, FR Head could be imposed on different stages of GCNs to build a multi-level refinement for stronger supervision. Extensive experiments are conducted on NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets. Our proposed models obtain competitive results from state-of-the-art methods and can help to discriminate those ambiguous samples. Codes are available at https://github.com/zhysora/FR-Head.
翻译:人体动作识别旨在从视频片段中分类人体动作的类别。近期,研究者们深入设计了基于图卷积网络的模型,从骨架数据中提取特征以完成该任务,因为骨架表示相比RGB帧等其他模态更为高效且鲁棒。然而,当使用骨架数据时,一些重要线索(如关联对象)也会被丢弃,导致某些模糊动作难以区分且易被误分类。为解决这一问题,我们提出了一种辅助特征精炼头(FR Head),该结构包含时空解耦和对比特征精炼,以获得骨架的判别性表示。在特征空间中动态发现并校准模糊样本。此外,FR Head可施加于图卷积网络的不同阶段,构建多层级精炼以提供更强的监督。在NTU RGB+D、NTU RGB+D 120和NW-UCLA数据集上进行了大量实验。我们提出的方法在现有最优方法中获得了具有竞争力的结果,并能帮助区分那些模糊样本。代码见https://github.com/zhysora/FR-Head。