Text-driven 3D scene generation techniques have made rapid progress in recent years. Their success is mainly attributed to using existing generative models to iteratively perform image warping and inpainting to generate 3D scenes. However, these methods heavily rely on the outputs of existing models, leading to error accumulation in geometry and appearance that prevent the models from being used in various scenarios (e.g., outdoor and unreal scenarios). To address this limitation, we generatively refine the newly generated local views by querying and aggregating global 3D information, and then progressively generate the 3D scene. Specifically, we employ a tri-plane features-based NeRF as a unified representation of the 3D scene to constrain global 3D consistency, and propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior from 2D diffusion model as well as the global 3D information of the current scene. Our extensive experiments demonstrate that, in comparison to previous methods, our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency.
翻译:文本驱动的三维场景生成技术近年来取得了快速发展,其成功主要归功于利用现有生成模型通过迭代执行图像变形与修复来生成三维场景。然而,这类方法高度依赖现有模型的输出结果,导致几何结构和外观特征上的误差累积,使得模型难以适用于多样化场景(例如室外场景和虚构场景)。为解决该局限性,我们通过查询与聚合全局三维信息对新生成的局部视图进行生成式精炼,并逐步构建三维场景。具体而言,我们采用基于三平面特征的NeRF作为三维场景的统一表示以约束全局三维一致性,同时提出生成式精炼网络,通过融合二维扩散模型的自然图像先验与当前场景的全局三维信息,合成更高品质的新内容。大量实验表明,相较于先前方法,我们的方法支持更广泛的场景生成类型与任意相机轨迹,并实现了更优的视觉质量与三维一致性。