Aided by text-to-image and text-to-video diffusion models, existing 4D content creation pipelines utilize score distillation sampling to optimize the entire dynamic 3D scene. However, as these pipelines generate 4D content from text or image inputs, they incur significant time and effort in prompt engineering through trial and error. This work introduces 4DGen, a novel, holistic framework for grounded 4D content creation that decomposes the 4D generation task into multiple stages. We identify static 3D assets and monocular video sequences as key components in constructing the 4D content. Our pipeline facilitates conditional 4D generation, enabling users to specify geometry (3D assets) and motion (monocular videos), thus offering superior control over content creation. Furthermore, we construct our 4D representation using dynamic 3D Gaussians, which permits efficient, high-resolution supervision through rendering during training, thereby facilitating high-quality 4D generation. Additionally, we employ spatial-temporal pseudo labels on anchor frames, along with seamless consistency priors implemented through 3D-aware score distillation sampling and smoothness regularizations. Compared to existing baselines, our approach yields competitive results in faithfully reconstructing input signals and realistically inferring renderings from novel viewpoints and timesteps. Most importantly, our method supports grounded generation, offering users enhanced control, a feature difficult to achieve with previous methods. Project page: https://vita-group.github.io/4DGen/
翻译:借助文本到图像和文本到视频扩散模型,现有4D内容生成管线利用分数蒸馏采样优化整个动态3D场景。然而,这些管线从文本或图像输入生成4D内容时,需通过反复试验进行提示工程,耗费大量时间与精力。本文提出4DGen——一种用于有基4D内容生成的新型整体框架,将4D生成任务分解为多个阶段。我们识别出静态3D资产与单目视频序列是构建4D内容的关键组件。本管线支持条件式4D生成,使用户能够指定几何(3D资产)与运动(单目视频),从而对内容创作实现更优控制。此外,我们采用动态3D高斯构建4D表示,通过训练期间的渲染实现高效高分辨率监督,进而促进高质量4D生成。同时,我们对锚定帧施加时空伪标签,并借助基于3D感知的分数蒸馏采样与平滑正则化实现无缝一致性先验。与现有基线相比,本方法在忠实重建输入信号、从新视角与时间步真实推断渲染结果方面均取得具有竞争力的成果。最重要的是,本方法支持有基生成,为用户提供增强控制能力,这是先前方法难以实现的特性。项目页面:https://vita-group.github.io/4DGen/