Unsupervised Spatial-Temporal Feature Enrichment and Fidelity Preservation Network for Skeleton based Action Recognition

Unsupervised skeleton based action recognition has achieved remarkable progress recently. Existing unsupervised learning methods suffer from severe overfitting problem, and thus small networks are used, significantly reducing the representation capability. To address this problem, the overfitting mechanism behind the unsupervised learning for skeleton based action recognition is first investigated. It is observed that the skeleton is already a relatively high-level and low-dimension feature, but not in the same manifold as the features for action recognition. Simply applying the existing unsupervised learning method may tend to produce features that discriminate the different samples instead of action classes, resulting in the overfitting problem. To solve this problem, this paper presents an Unsupervised spatial-temporal Feature Enrichment and Fidelity Preservation framework (U-FEFP) to generate rich distributed features that contain all the information of the skeleton sequence. A spatial-temporal feature transformation subnetwork is developed using spatial-temporal graph convolutional network and graph convolutional gate recurrent unit network as the basic feature extraction network. The unsupervised Bootstrap Your Own Latent based learning is used to generate rich distributed features and the unsupervised pretext task based learning is used to preserve the information of the skeleton sequence. The two unsupervised learning ways are collaborated as U-FEFP to produce robust and discriminative representations. Experimental results on three widely used benchmarks, namely NTU-RGB+D-60, NTU-RGB+D-120 and PKU-MMD dataset, demonstrate that the proposed U-FEFP achieves the best performance compared with the state-of-the-art unsupervised learning methods. t-SNE illustrations further validate that U-FEFP can learn more discriminative features for unsupervised skeleton based action recognition.

翻译：无监督骨架动作识别近期取得了显著进展。现有无监督学习方法存在严重的过拟合问题，因此通常使用小型网络，这大幅降低了表征能力。为解决该问题，本文首先探究了骨架动作识别无监督学习背后的过拟合机制。研究发现，骨架本身属于相对高层且低维的特征，但其流形与动作识别所需特征并不一致。直接应用现有无监督学习方法可能倾向于生成区分不同样本而非动作类别的特征，从而导致过拟合问题。针对此问题，本文提出了一种无监督时空特征增强与保真度保持框架（U-FEFP），用于生成包含骨架序列全部信息的丰富分布特征。该框架利用时空图卷积网络和图卷积门控循环单元网络作为基础特征提取网络，构建了时空特征变换子网络。采用基于无监督Bootstrap Your Own Latent的学习方法生成丰富分布特征，同时利用基于无监督前置任务的学习方法保留骨架序列信息。两种无监督学习方式协同构成U-FEFP，以生成鲁棒且具有判别性的表征。在NTU-RGB+D-60、NTU-RGB+D-120和PKU-MMD三个广泛使用的基准数据集上的实验结果表明，与最先进的无监督学习方法相比，所提出的U-FEFP取得了最佳性能。t-SNE可视化进一步验证了U-FEFP能够为无监督骨架动作识别学习到更具判别性的特征。