The appearance of a human in clothing is driven not only by the pose but also by its temporal context, i.e., motion. However, such context has been largely neglected by existing monocular human modeling methods whose neural networks often struggle to learn a video of a person with large dynamics due to the motion ambiguity, i.e., there exist numerous geometric configurations of clothes that are dependent on the context of motion even for the same pose. In this paper, we introduce a method for high-quality modeling of clothed 3D human avatars using a video of a person with dynamic movements. The main challenge comes from the lack of 3D ground truth data of geometry and its temporal correspondences. We address this challenge by introducing a novel compositional human modeling framework that takes advantage of both explicit and implicit human modeling. For explicit modeling, a neural network learns to generate point-wise shape residuals and appearance features of a 3D body model by comparing its 2D rendering results and the original images. This explicit model allows for the reconstruction of discriminative 3D motion features from UV space by encoding their temporal correspondences. For implicit modeling, an implicit network combines the appearance and 3D motion features to decode high-fidelity clothed 3D human avatars with motion-dependent geometry and texture. The experiments show that our method can generate a large variation of secondary motion in a physically plausible way.
翻译:人体衣着外观不仅受姿态驱动,更依赖于其时间上下文信息,即运动状态。然而现有单目人体建模方法普遍忽略了这一上下文信息——由于运动歧义性(即使相同姿态下,衣物几何构型也存在诸多与运动上下文相关的可能性),神经网络难以从包含大幅运动的人体视频中有效学习。本文提出一种利用动态运动人体视频进行高质量三维衣着人体建模的方法。主要挑战在于缺乏真实三维几何数据及其时间对应关系。为此,我们引入融合显式与隐式建模的新型复合人体建模框架:在显式建模中,通过比较三维人体模型二维渲染结果与原始图像,神经网络学习生成逐点形状残差和外观特征;该显式模型通过编码时间对应关系,可从UV空间重建具有判别性的三维运动特征。在隐式建模中,隐式网络融合外观与三维运动特征,解码出具有运动相关几何与纹理的高保真三维衣着人体模型。实验表明,本方法能够以符合物理规律的方式生成丰富多样的二次运动。