Automatic estimation of 3D human pose from monocular RGB images is a challenging and unsolved problem in computer vision. In a supervised manner, approaches heavily rely on laborious annotations and present hampered generalization ability due to the limited diversity of 3D pose datasets. To address these challenges, we propose a unified framework that leverages mask as supervision for unsupervised 3D pose estimation. With general unsupervised segmentation algorithms, the proposed model employs skeleton and physique representations that exploit accurate pose information from coarse to fine. Compared with previous unsupervised approaches, we organize the human skeleton in a fully unsupervised way which enables the processing of annotation-free data and provides ready-to-use estimation results. Comprehensive experiments demonstrate our state-of-the-art pose estimation performance on Human3.6M and MPI-INF-3DHP datasets. Further experiments on in-the-wild datasets also illustrate the capability to access more data to boost our model. Code will be available at https://github.com/Charrrrrlie/Mask-as-Supervision.
翻译:从单目RGB图像自动估计三维人体姿态是计算机视觉中一个具有挑战性且尚未完全解决的问题。在监督学习范式下,现有方法严重依赖费时费力的人工标注,且由于三维姿态数据集多样性有限,其泛化能力受到制约。为应对这些挑战,我们提出了一个统一框架,利用掩码作为监督信号进行无监督三维姿态估计。通过结合通用的无监督分割算法,所提模型采用骨骼与形体表征,从粗粒度到细粒度逐步挖掘精确的姿态信息。相较于以往的无监督方法,我们以完全无监督的方式构建人体骨骼拓扑,使其能够处理无标注数据并直接输出可用姿态估计结果。在Human3.6M和MPI-INF-3DHP数据集上的综合实验表明,本方法取得了当前最优的姿态估计性能。在野外数据集上的进一步实验也验证了模型能够利用更广泛数据提升性能。代码将在https://github.com/Charrrrrlie/Mask-as-Supervision发布。