Reconstructing complete and animatable 3D human avatars from monocular videos remains challenging, particularly under severe occlusions. While 3D Gaussian Splatting has enabled photorealistic human rendering, existing methods struggle with incomplete observations, often producing corrupted geometry and temporal inconsistencies. We present InpaintHuman, a novel method for generating high-fidelity, complete, and animatable avatars from occluded monocular videos. Our approach introduces two key innovations: (i) a multi-scale UV-parameterized representation with hierarchical coarse-to-fine feature interpolation, enabling robust reconstruction of occluded regions while preserving geometric details; and (ii) an identity-preserving diffusion inpainting module that integrates textual inversion with semantic-conditioned guidance for subject-specific, temporally coherent completion. Unlike SDS-based methods, our approach employs direct pixel-level supervision to ensure identity fidelity. Experiments on synthetic benchmarks (PeopleSnapshot, ZJU-MoCap) and real-world scenarios (OcMotion) demonstrate competitive performance with consistent improvements in reconstruction quality across diverse poses and viewpoints.
翻译:从单目视频中重建完整且可动画化的三维人体化身仍然具有挑战性,尤其是在严重遮挡的情况下。虽然三维高斯泼溅技术已能实现照片级真实感的人体渲染,但现有方法在处理不完整观测时往往力不从心,常产生破损的几何结构和时间上的不一致性。我们提出了InpaintHuman,一种从遮挡的单目视频中生成高保真、完整且可动画化化身的新方法。我们的方法引入了两个关键创新:(i) 一种多尺度UV参数化表示,结合了从粗到细的分层特征插值,能够在保留几何细节的同时,对遮挡区域进行鲁棒重建;(ii) 一个身份保持扩散修复模块,它将文本反转与语义条件引导相结合,以实现针对特定主体的、时间上连贯的补全。与基于SDS的方法不同,我们的方法采用直接的像素级监督来确保身份保真度。在合成基准数据集(PeopleSnapshot, ZJU-MoCap)和真实场景(OcMotion)上的实验表明,该方法具有竞争优势,在不同姿态和视角下的重建质量均获得了一致性提升。