In this paper, we present a diffusion model-based framework for animating people from a single image for a given target 3D motion sequence. Our approach has two core components: a) learning priors about invisible parts of the human body and clothing, and b) rendering novel body poses with proper clothing and texture. For the first part, we learn an in-filling diffusion model to hallucinate unseen parts of a person given a single image. We train this model on texture map space, which makes it more sample-efficient since it is invariant to pose and viewpoint. Second, we develop a diffusion-based rendering pipeline, which is controlled by 3D human poses. This produces realistic renderings of novel poses of the person, including clothing, hair, and plausible in-filling of unseen regions. This disentangled approach allows our method to generate a sequence of images that are faithful to the target motion in the 3D pose and, to the input image in terms of visual similarity. In addition to that, the 3D control allows various synthetic camera trajectories to render a person. Our experiments show that our method is resilient in generating prolonged motions and varied challenging and complex poses compared to prior methods. Please check our website for more details: https://boyiliee.github.io/3DHM.github.io/.
翻译:本文提出了一种基于扩散模型的框架,可从单张图像为给定的目标3D运动序列生成人物动画。该方法包含两个核心组件:(a) 学习人体不可见部位及衣物的先验知识;(b) 以合适的衣物和纹理渲染新姿态。首先,我们学习了一个填充扩散模型,能够根据单张图像幻觉出人物不可见部分。该模型在纹理图空间进行训练,由于纹理图与姿态和视角无关,因此样本效率更高。其次,我们开发了一个由3D人体姿态控制的扩散渲染管线,可生成人物新姿态的真实渲染结果,包括衣物、头发以及不可见区域的合理填充。这种解耦方法使我们的方法能够生成一系列图像,既忠实于目标3D运动序列中的姿态,又在视觉相似性上与输入图像保持一致。此外,3D控制支持使用多种合成摄像机轨迹对人物进行渲染。实验表明,与现有方法相比,我们的方法在生成长时间运动序列及各种复杂挑战性姿态方面具有更强的鲁棒性。更多详情请访问我们的网站:https://boyiliee.github.io/3DHM.github.io/。