Recent advances in generative diffusion models have enabled the previously unfeasible capability of generating 3D assets from a single input image or a text prompt. In this work, we aim to enhance the quality and functionality of these models for the task of creating controllable, photorealistic human avatars. We achieve this by integrating a 3D morphable model into the state-of-the-art multi-view-consistent diffusion approach. We demonstrate that accurate conditioning of a generative pipeline on the articulated 3D model enhances the baseline model performance on the task of novel view synthesis from a single image. More importantly, this integration facilitates a seamless and accurate incorporation of facial expression and body pose control into the generation process. To the best of our knowledge, our proposed framework is the first diffusion model to enable the creation of fully 3D-consistent, animatable, and photorealistic human avatars from a single image of an unseen subject; extensive quantitative and qualitative evaluations demonstrate the advantages of our approach over existing state-of-the-art avatar creation models on both novel view and novel expression synthesis tasks. The code for our project is publicly available.
翻译:近年来,生成式扩散模型的突破性进展使得从单张输入图像或文本提示生成3D资产成为可能。本研究旨在提升此类模型在创建可控、逼真人体化身任务中的质量与功能。我们通过将3D可变形模型集成到当前最先进的多视角一致性扩散方法中来实现这一目标。研究表明,将生成管线与铰接3D模型进行精确条件约束,能够显著提升基线模型在单图像新视角合成任务上的性能。更重要的是,这种集成方式使得面部表情与身体姿态控制能够无缝、精确地融入生成流程。据我们所知,所提出的框架是首个能够从单张未见对象图像创建完全3D一致、可动画化且逼真化身的扩散模型;大量定量与定性评估表明,本方法在新视角合成和新表情合成任务上均优于现有最先进的化身创建模型。本项目的代码已公开。