Large-scale diffusion generative models are greatly simplifying image, video and 3D asset creation from user-provided text prompts and images. However, the challenging problem of text-to-4D dynamic 3D scene generation with diffusion guidance remains largely unexplored. We propose Dream-in-4D, which features a novel two-stage approach for text-to-4D synthesis, leveraging (1) 3D and 2D diffusion guidance to effectively learn a high-quality static 3D asset in the first stage; (2) a deformable neural radiance field that explicitly disentangles the learned static asset from its deformation, preserving quality during motion learning; and (3) a multi-resolution feature grid for the deformation field with a displacement total variation loss to effectively learn motion with video diffusion guidance in the second stage. Through a user preference study, we demonstrate that our approach significantly advances image and motion quality, 3D consistency and text fidelity for text-to-4D generation compared to baseline approaches. Thanks to its motion-disentangled representation, Dream-in-4D can also be easily adapted for controllable generation where appearance is defined by one or multiple images, without the need to modify the motion learning stage. Thus, our method offers, for the first time, a unified approach for text-to-4D, image-to-4D and personalized 4D generation tasks.
翻译:大规模扩散生成模型正极大地简化从用户提供的文本提示和图像创建图像、视频和三维资产的过程。然而,在扩散引导下从文本生成四维动态三维场景这一具有挑战性的问题仍未得到充分探索。我们提出Dream-in-4D,该方法采用一种新颖的两阶段文本到四维合成方法,利用:(1) 三维和二维扩散引导,在第一阶段有效学习高质量静态三维资产;(2) 可变形神经辐射场,将学习到的静态资产与其形变显式解耦,在运动学习期间保持质量;(3) 用于形变场的多分辨率特征网格,结合位移总变分损失,在第二阶段通过视频扩散引导有效学习运动。通过用户偏好研究,我们证明与基准方法相比,我们的方法在文本到四维生成中显著提升了图像和运动质量、三维一致性及文本保真度。得益于其运动解耦表示,Dream-in-4D还可轻松适应可控生成任务,其中外观由一个或多个图像定义,而无需修改运动学习阶段。因此,我们的方法首次为文本到四维、图像到四维以及个性化四维生成任务提供了统一方案。