Creating realistic avatars from a single RGB image is an attractive yet challenging problem. Due to its ill-posed nature, recent works leverage powerful prior from 2D diffusion models pretrained on large datasets. Although 2D diffusion models demonstrate strong generalization capability, they cannot provide multi-view shape priors with guaranteed 3D consistency. We propose Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion. Our key insight is that 2D multi-view diffusion and 3D reconstruction models provide complementary information for each other, and by coupling them in a tight manner, we can fully leverage the potential of both models. We introduce a novel image-conditioned generative 3D Gaussian Splats reconstruction model that leverages the priors from 2D multi-view diffusion models, and provides an explicit 3D representation, which further guides the 2D reverse sampling process to have better 3D consistency. Experiments show that our proposed framework outperforms state-of-the-art methods and enables the creation of realistic avatars from a single RGB image, achieving high-fidelity in both geometry and appearance. Extensive ablations also validate the efficacy of our design, (1) multi-view 2D priors conditioning in generative 3D reconstruction and (2) consistency refinement of sampling trajectory via the explicit 3D representation. Our code and models will be released on https://yuxuan-xue.com/human-3diffusion.
翻译:从单张RGB图像创建逼真的数字人是一个极具吸引力但充满挑战的问题。由于其不适定性,近期研究利用在大型数据集上预训练的二维扩散模型所提供的强大先验。尽管二维扩散模型展现出强大的泛化能力,但它们无法提供具有三维一致性保证的多视角形状先验。我们提出Human 3Diffusion:通过显式三维一致扩散实现逼真数字人创建。我们的核心见解是,二维多视角扩散模型与三维重建模型能够为彼此提供互补信息,通过将它们紧密耦合,我们可以充分发挥两种模型的潜力。我们引入了一种新颖的图像条件生成式三维高斯泼溅重建模型,该模型利用来自二维多视角扩散模型的先验,并提供显式的三维表示,这进一步引导二维反向采样过程以获得更好的三维一致性。实验表明,我们提出的框架优于现有最先进方法,能够从单张RGB图像创建逼真的数字人,在几何形状与外观上都实现了高保真度。广泛的消融实验也验证了我们设计的有效性:(1)生成式三维重建中的多视角二维先验条件化,以及(2)通过显式三维表示对采样轨迹进行的一致性优化。我们的代码与模型将在 https://yuxuan-xue.com/human-3diffusion 上发布。