Human reconstruction and synthesis from monocular RGB videos is a challenging problem due to clothing, occlusion, texture discontinuities and sharpness, and framespecific pose changes. Many methods employ deferred rendering, NeRFs and implicit methods to represent clothed humans, on the premise that mesh-based representations cannot capture complex clothing and textures from RGB, silhouettes, and keypoints alone. We provide a counter viewpoint to this fundamental premise by optimizing a SMPL+D mesh and an efficient, multi-resolution texture representation using only RGB images, binary silhouettes and sparse 2D keypoints. Experimental results demonstrate that our approach is more capable of capturing geometric details compared to visual hull, mesh-based methods. We show competitive novel view synthesis and improvements in novel pose synthesis compared to NeRF-based methods, which introduce noticeable, unwanted artifacts. By restricting the solution space to the SMPL+D model combined with differentiable rendering, we obtain dramatic speedups in compute, training times (up to 24x) and inference times (up to 192x). Our method therefore can be used as is or as a fast initialization to NeRF-based methods.
翻译:从单目RGB视频中进行人体重建与合成是一项极具挑战性的问题,主要源于衣物遮挡、纹理不连续性及锐度变化、以及帧间姿态差异等因素。许多方法采用延迟渲染、神经辐射场和隐式方法表示着装人体,其前提假设是基于网格的表示方法无法仅通过RGB图像、轮廓和关键点来捕捉复杂的衣物与纹理细节。我们对此基础假设提出反驳观点,通过仅使用RGB图像、二值轮廓和稀疏二维关键点,优化SMPL+D网格与高效多分辨率纹理表示。实验结果表明,与视觉外壳和基于网格的方法相比,本方法在几何细节捕捉方面更具优势。与引入明显伪影的神经辐射场方法相比,我们在新颖视角合成方面展现出竞争力,并在新颖姿态合成方面取得改进。通过将解空间约束至SMPL+D模型并结合可微渲染,我们在计算效率、训练时间(最高加速24倍)和推理时间(最高加速192倍)上均获得显著提升。因此,本方法既可独立使用,也可作为神经辐射场方法的快速初始化方案。