We introduce RealmDreamer, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require video or multi-view data and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image.
翻译:我们提出了RealmDreamer技术,用于根据文本描述生成通用的前向3D场景。该技术通过优化3D高斯溅射表示,使其与复杂的文本提示相匹配。我们利用当前最先进的文本到图像生成器初始化这些高斯溅射体,将生成样本提升至3D空间,并计算遮挡体积。随后,我们将该表示作为3D修补任务,结合图像条件扩散模型进行多视图优化。为学习正确的几何结构,我们引入深度扩散模型,通过以修补模型生成的样本为条件,赋予丰富的几何结构信息。最后,我们使用图像生成器生成的锐化样本对模型进行微调。值得注意的是,我们的技术无需视频或多视图数据,即可合成包含多对象的多样风格的高质量3D场景。该方法的通用性还支持从单张图像合成3D场景。