Recent unsupervised methods for monocular 3D pose estimation have endeavored to reduce dependence on limited annotated 3D data, but most are solely formulated in 2D space, overlooking the inherent depth ambiguity issue. Due to the information loss in 3D-to-2D projection, multiple potential depths may exist, yet only some of them are plausible in human structure. To tackle depth ambiguity, we propose a novel unsupervised framework featuring a multi-hypothesis detector and multiple tailored pretext tasks. The detector extracts multiple hypotheses from a heatmap within a local window, effectively managing the multi-solution problem. Furthermore, the pretext tasks harness 3D human priors from the SMPL model to regularize the solution space of pose estimation, aligning it with the empirical distribution of 3D human structures. This regularization is partially achieved through a GCN-based discriminator within the discriminative learning, and is further complemented with synthetic images through rendering, ensuring plausible estimations. Consequently, our approach demonstrates state-of-the-art unsupervised 3D pose estimation performance on various human datasets. Further evaluations on data scale-up and one animal dataset highlight its generalization capabilities. Code will be available at https://github.com/Charrrrrlie/X-as-Supervision.
翻译:近期无监督单目三维姿态估计方法致力于减少对有限标注三维数据的依赖,但多数方法仅在二维空间构建,忽略了固有的深度模糊问题。由于三维到二维投影存在信息损失,可能对应多个潜在深度值,但其中仅部分符合人体结构合理性。为应对深度模糊,本文提出一种新型无监督框架,包含多假设检测器和多个定制化前置任务。该检测器通过局部窗口从热力图中提取多组假设,有效处理多解问题。此外,前置任务利用SMPL模型的三维人体先验知识对姿态估计的解空间进行正则化,使其与三维人体结构的经验分布保持一致。该正则化过程部分通过判别式学习中的基于GCN的判别器实现,并辅以渲染生成的合成图像,确保估计结果的合理性。实验表明,本方法在多种人体数据集上实现了最先进的无监督三维姿态估计性能。在数据规模扩展及动物数据集上的进一步评估验证了其泛化能力。代码发布于https://github.com/Charrrrrlie/X-as-Supervision。