Leveraging pretrained 2D diffusion models and score distillation sampling (SDS), recent methods have shown promising results for text-to-3D avatar generation. However, generating high-quality 3D avatars capable of expressive animation remains challenging. In this work, we present DreamWaltz-G, a novel learning framework for animatable 3D avatar generation from text. The core of this framework lies in Skeleton-guided Score Distillation and Hybrid 3D Gaussian Avatar representation. Specifically, the proposed skeleton-guided score distillation integrates skeleton controls from 3D human templates into 2D diffusion models, enhancing the consistency of SDS supervision in terms of view and human pose. This facilitates the generation of high-quality avatars, mitigating issues such as multiple faces, extra limbs, and blurring. The proposed hybrid 3D Gaussian avatar representation builds on the efficient 3D Gaussians, combining neural implicit fields and parameterized 3D meshes to enable real-time rendering, stable SDS optimization, and expressive animation. Extensive experiments demonstrate that DreamWaltz-G is highly effective in generating and animating 3D avatars, outperforming existing methods in both visual quality and animation expressiveness. Our framework further supports diverse applications, including human video reenactment and multi-subject scene composition.
翻译:利用预训练的二维扩散模型和分数蒸馏采样(SDS),近期方法在文本到三维化身生成方面已展现出有前景的结果。然而,生成能够进行富有表现力动画的高质量三维化身仍然具有挑战性。在本工作中,我们提出了DreamWaltz-G,一个用于从文本生成可动画三维化身的全新学习框架。该框架的核心在于骨架引导的分数蒸馏与混合三维高斯化身表示。具体而言,所提出的骨架引导分数蒸馏将来自三维人体模板的骨架控制集成到二维扩散模型中,增强了SDS监督在视角和人体姿态方面的一致性。这有助于生成高质量化身,缓解了多张脸、额外肢体和模糊等问题。所提出的混合三维高斯化身表示建立在高效的三维高斯表示基础之上,结合了神经隐式场和参数化三维网格,以实现实时渲染、稳定的SDS优化以及富有表现力的动画。大量实验表明,DreamWaltz-G在生成和动画三维化身方面非常有效,在视觉质量和动画表现力方面均优于现有方法。我们的框架进一步支持多样化的应用,包括人体视频重演和多主体场景合成。