In this paper, we present a diffusion model-based framework for animating people from a single image for a given target 3D motion sequence. Our approach has two core components: a) learning priors about invisible parts of the human body and clothing, and b) rendering novel body poses with proper clothing and texture. For the first part, we learn an in-filling diffusion model to hallucinate unseen parts of a person given a single image. We train this model on texture map space, which makes it more sample-efficient since it is invariant to pose and viewpoint. Second, we develop a diffusion-based rendering pipeline, which is controlled by 3D human poses. This produces realistic renderings of novel poses of the person, including clothing, hair, and plausible in-filling of unseen regions. This disentangled approach allows our method to generate a sequence of images that are faithful to the target motion in the 3D pose and, to the input image in terms of visual similarity. In addition to that, the 3D control allows various synthetic camera trajectories to render a person. Our experiments show that our method is resilient in generating prolonged motions and varied challenging and complex poses compared to prior methods. Please check our website for more details: https://boyiliee.github.io/3DHM.github.io/.
翻译:本文提出一种基于扩散模型的框架,用于根据给定目标三维运动序列对单张图像中的人物进行动画生成。我们的方法包含两个核心组件:a) 学习人体及衣物不可见部位的先验知识;b) 渲染具有合理衣物与纹理的新身体姿态。针对第一部分,我们训练了一个补全扩散模型,能够根据单张图像幻觉化人物的不可见部位。该模型在纹理贴图空间中进行训练,由于其对姿态和视角具有不变性,因而具备更高的样本效率。其次,我们开发了基于扩散的渲染流程,该流程通过三维人体姿态进行控制。该流程能够生成人物新姿态的真实渲染结果,包括衣物、头发以及对不可见区域的合理补全。这种解耦方法使我们的方法能够生成既忠实于三维姿态目标运动,又在视觉相似度上忠实于输入图像的图像序列。此外,三维控制允许通过多种合成相机轨迹来渲染人物。实验表明,与现有方法相比,我们的方法在生成长时间运动及各种具有挑战性的复杂姿态方面表现出更强的鲁棒性。更多细节请访问我们的网站:https://boyiliee.github.io/3DHM.github.io/。