Large-scale diffusion generative models are greatly simplifying image, video and 3D asset creation from user-provided text prompts and images. However, the challenging problem of text-to-4D dynamic 3D scene generation with diffusion guidance remains largely unexplored. We propose Dream-in-4D, which features a novel two-stage approach for text-to-4D synthesis, leveraging (1) 3D and 2D diffusion guidance to effectively learn a high-quality static 3D asset in the first stage; (2) a deformable neural radiance field that explicitly disentangles the learned static asset from its deformation, preserving quality during motion learning; and (3) a multi-resolution feature grid for the deformation field with a displacement total variation loss to effectively learn motion with video diffusion guidance in the second stage. Through a user preference study, we demonstrate that our approach significantly advances image and motion quality, 3D consistency and text fidelity for text-to-4D generation compared to baseline approaches. Thanks to its motion-disentangled representation, Dream-in-4D can also be easily adapted for controllable generation where appearance is defined by one or multiple images, without the need to modify the motion learning stage. Thus, our method offers, for the first time, a unified approach for text-to-4D, image-to-4D and personalized 4D generation tasks.
翻译:大规模扩散生成模型极大简化了基于用户文本提示与图像的图像、视频及三维资产创建流程。然而,具有扩散引导的文本到四维动态三维场景生成这一具有挑战性的问题仍鲜有探索。我们提出Dream-in-4D,该框架采用新颖的两阶段方法实现文本到四维合成,其核心创新包括:(1)第一阶段利用三维与二维扩散引导高效学习高质量静态三维资产;(2)可变形神经辐射场,将学习到的静态资产与其形变显式解耦,从而在运动学习过程中保持质量;(3)第二阶段采用多分辨率特征网格构建形变场,并引入位移全变分损失,结合视频扩散引导有效学习运动。通过用户偏好研究,我们证明相比基线方法,本方法在文本到四维生成的图像与运动质量、三维一致性及文本保真度方面取得显著提升。得益于运动解耦表征,Dream-in-4D还可轻松适配外观由单张或多张图像定义的可控生成任务,无需修改运动学习阶段。因此,本方法首次为文本到四维、图像到四维及个性化四维生成任务提供了统一解决方案。