We present MAV3D (Make-A-Video3D), a method for generating three-dimensional dynamic scenes from text descriptions. Our approach uses a 4D dynamic Neural Radiance Field (NeRF), which is optimized for scene appearance, density, and motion consistency by querying a Text-to-Video (T2V) diffusion-based model. The dynamic video output generated from the provided text can be viewed from any camera location and angle, and can be composited into any 3D environment. MAV3D does not require any 3D or 4D data and the T2V model is trained only on Text-Image pairs and unlabeled videos. We demonstrate the effectiveness of our approach using comprehensive quantitative and qualitative experiments and show an improvement over previously established internal baselines. To the best of our knowledge, our method is the first to generate 3D dynamic scenes given a text description.
翻译:我们提出了MAV3D(Make-A-Video3D),一种从文本描述生成三维动态场景的方法。我们的方法采用四维动态神经辐射场(NeRF),通过查询基于文本到视频(T2V)扩散模型来优化场景外观、密度和运动一致性。从给定文本生成的动态视频输出可以从任意相机位置和角度观看,并能够合成到任何三维环境中。MAV3D不需要任何三维或四维数据,且T2V模型仅基于文本-图像对和无标签视频进行训练。我们通过全面的定量和定性实验证明了该方法的有效性,并展示了相较于先前建立的内部基线的改进。据我们所知,我们的方法是首个从文本描述生成三维动态场景的技术。