Most recent view-invariant action recognition and performance assessment approaches rely on a large amount of annotated 3D skeleton data to extract view-invariant features. However, acquiring 3D skeleton data can be cumbersome, if not impractical, in in-the-wild scenarios. To overcome this problem, we present a novel unsupervised approach that learns to extract view-invariant 3D human pose representation from a 2D image without using 3D joint data. Our model is trained by exploiting the intrinsic view-invariant properties of human pose between simultaneous frames from different viewpoints and their equivariant properties between augmented frames from the same viewpoint. We evaluate the learned view-invariant pose representations for two downstream tasks. We perform comparative experiments that show improvements on the state-of-the-art unsupervised cross-view action classification accuracy on NTU RGB+D by a significant margin, on both RGB and depth images. We also show the efficiency of transferring the learned representations from NTU RGB+D to obtain the first ever unsupervised cross-view and cross-subject rank correlation results on the multi-view human movement quality dataset, QMAR, and marginally improve on the-state-of-the-art supervised results for this dataset. We also carry out ablation studies to examine the contributions of the different components of our proposed network.
翻译:当前大多数视角不变的动作识别与性能评估方法依赖于大量带标注的三维骨骼数据来提取视角不变特征。然而,在真实场景中获取三维骨骼数据往往繁琐且不切实际。为解决这一问题,我们提出了一种新颖的无监督方法,能够在不使用三维关节数据的情况下,从二维图像中学习提取视角不变的三维人体姿态表示。我们的模型通过利用不同视角同步帧之间人体姿态固有的视角不变特性,以及同一视角增强帧之间的等变特性进行训练。我们在两个下游任务中评估了所学到的视角不变姿态表示。对比实验表明,在NTU RGB+D数据集上,无论是RGB图像还是深度图像,我们的方法在无监督跨视角动作分类准确率上均显著超越了当前最优水平。此外,我们展示了将NTU RGB+D数据集学习到的表示迁移至多视角人体运动质量数据集QMAR的有效性,首次实现了该数据集上无监督跨视角与跨受试者的秩相关评估,并以微弱优势改进了该数据集当前的最优监督学习结果。我们还通过消融实验验证了所提出网络各组成部分的贡献。