It is very challenging to accurately reconstruct sophisticated human geometry caused by various poses and garments from a single image. Recently, works based on pixel-aligned implicit function (PIFu) have made a big step and achieved state-of-the-art fidelity on image-based 3D human digitization. However, the training of PIFu relies heavily on expensive and limited 3D ground truth data (i.e. synthetic data), thus hindering its generalization to more diverse real world images. In this work, we propose an end-to-end self-supervised network named SelfPIFu to utilize abundant and diverse in-the-wild images, resulting in largely improved reconstructions when tested on unconstrained in-the-wild images. At the core of SelfPIFu is the depth-guided volume-/surface-aware signed distance fields (SDF) learning, which enables self-supervised learning of a PIFu without access to GT mesh. The whole framework consists of a normal estimator, a depth estimator, and a SDF-based PIFu and better utilizes extra depth GT during training. Extensive experiments demonstrate the effectiveness of our self-supervised framework and the superiority of using depth as input. On synthetic data, our Intersection-Over-Union (IoU) achieves to 93.5%, 18% higher compared with PIFuHD. For in-the-wild images, we conduct user studies on the reconstructed results, the selection rate of our results is over 68% compared with other state-of-the-art methods.
翻译:从单张图像精确重建由多样姿态和服装导致的复杂人体几何结构极具挑战性。近年来,基于像素对齐隐式函数(PIFu)的研究取得了重大进展,在基于图像的三维人体数字化领域实现了最先进的重建保真度。然而,PIFu的训练严重依赖昂贵且有限的真实三维标注数据(即合成数据),这阻碍了其对更广泛真实世界图像的泛化能力。本文提出了一种名为SelfPIFu的端到端自监督网络,通过利用丰富多样的自然场景图像,显著提升了在不受约束的自然图像上的重建效果。SelfPIFu的核心在于深度引导的体积/表面感知有符号距离场(SDF)学习,这使得在无需真实网格数据的情况下实现PIFu的自监督学习。整个框架包含法向估计器、深度估计器和基于SDF的PIFu,并在训练过程中更有效地利用额外的深度标注数据。大量实验证明了我们自监督框架的有效性以及使用深度作为输入的优越性。在合成数据上,我们的交并比(IoU)达到93.5%,较PIFuHD提升18%。针对自然场景图像,我们对重建结果进行了用户研究,与其他最先进方法相比,本方法结果的被选中率超过68%。