Recent progress in human shape learning, shows that neural implicit models are effective in generating 3D human surfaces from limited number of views, and even from a single RGB image. However, existing monocular approaches still struggle to recover fine geometric details such as face, hands or cloth wrinkles. They are also easily prone to depth ambiguities that result in distorted geometries along the camera optical axis. In this paper, we explore the benefits of incorporating depth observations in the reconstruction process by introducing ANIM, a novel method that reconstructs arbitrary 3D human shapes from single-view RGB-D images with an unprecedented level of accuracy. Our model learns geometric details from both multi-resolution pixel-aligned and voxel-aligned features to leverage depth information and enable spatial relationships, mitigating depth ambiguities. We further enhance the quality of the reconstructed shape by introducing a depth-supervision strategy, which improves the accuracy of the signed distance field estimation of points that lie on the reconstructed surface. Experiments demonstrate that ANIM outperforms state-of-the-art works that use RGB, surface normals, point cloud or RGB-D data as input. In addition, we introduce ANIM-Real, a new multi-modal dataset comprising high-quality scans paired with consumer-grade RGB-D camera, and our protocol to fine-tune ANIM, enabling high-quality reconstruction from real-world human capture.
翻译:人体形状学习的最新进展表明,神经隐式模型能够从有限视角甚至单张RGB图像生成逼真的三维人体表面。然而,现有单目方法仍难以恢复面部、手部或衣物褶皱等精细几何细节,且易受深度模糊影响,导致沿相机光轴方向产生几何失真。本文通过引入ANIM方法,探索在重建过程中融合深度观测信息的优势——该方法能从单视角RGB-D图像以空前精度重建任意三维人体形状。我们提出的模型通过多分辨率像素对齐与体素对齐特征学习几何细节,从而充分利用深度信息并建立空间关联,有效缓解深度模糊。此外,我们引入深度监督策略进一步提升重建形状质量,该策略可优化位于重建表面上的点的符号距离场估计精度。实验表明,ANIM在输入为RGB、表面法向、点云或RGB-D数据的方法中均优于现有最优技术。同时,我们构建了ANIM-Real——一个包含高质量扫描数据与消费级RGB-D相机采集数据的新多模态数据集,并设计了ANIM微调协议,从而实现对真实人体捕获数据的高质量重建。