For reconstructing high-fidelity human 3D models from monocular videos, it is crucial to maintain consistent large-scale body shapes along with finely matched subtle wrinkles. This paper explores the observation that the per-frame rendering results can be factorized into a pose-independent component and a corresponding pose-dependent equivalent to facilitate frame consistency. Pose adaptive textures can be further improved by restricting frequency bands of these two components. In detail, pose-independent outputs are expected to be low-frequency, while highfrequency information is linked to pose-dependent factors. We achieve a coherent preservation of both coarse body contours across the entire input video and finegrained texture features that are time variant with a dual-branch network with distinct frequency components. The first branch takes coordinates in canonical space as input, while the second branch additionally considers features outputted by the first branch and pose information of each frame. Our network integrates the information predicted by both branches and utilizes volume rendering to generate photo-realistic 3D human images. Through experiments, we demonstrate that our network surpasses the neural radiance fields (NeRF) based state-of-the-art methods in preserving high-frequency details and ensuring consistent body contours.
翻译:从单目视频重建高保真人体三维模型时,保持大尺度体型的一致性与细微褶皱的精确匹配至关重要。本文研究发现,可将逐帧渲染结果分解为姿态无关分量与相应的姿态相关等价分量,以提升帧间一致性。通过限制这两个分量的频带范围,可进一步优化姿态自适应纹理。具体而言,姿态无关输出应保持低频特性,而高频信息则与姿态相关因子关联。我们通过构建具有分离频率分量的双分支网络,实现了对输入视频中整体粗粒度人体轮廓的连贯保持,以及对随时间变化的细粒度纹理特征的精准建模。第一分支以规范空间坐标为输入,第二分支额外考虑第一分支的输出特征及各帧姿态信息。本网络整合双分支的预测信息,并利用体渲染技术生成逼真三维人体图像。实验表明,在保持高频细节与确保体型轮廓一致性方面,本网络超越了基于神经辐射场(NeRF)的现有最优方法。