In this work, we present SceneDreamer, an unconditional generative model for unbounded 3D scenes, which synthesizes large-scale 3D landscapes from random noise. Our framework is learned from in-the-wild 2D image collections only, without any 3D annotations. At the core of SceneDreamer is a principled learning paradigm comprising 1) an efficient yet expressive 3D scene representation, 2) a generative scene parameterization, and 3) an effective renderer that can leverage the knowledge from 2D images. Our approach begins with an efficient bird's-eye-view (BEV) representation generated from simplex noise, which includes a height field for surface elevation and a semantic field for detailed scene semantics. This BEV scene representation enables 1) representing a 3D scene with quadratic complexity, 2) disentangled geometry and semantics, and 3) efficient training. Moreover, we propose a novel generative neural hash grid to parameterize the latent space based on 3D positions and scene semantics, aiming to encode generalizable features across various scenes. Lastly, a neural volumetric renderer, learned from 2D image collections through adversarial training, is employed to produce photorealistic images. Extensive experiments demonstrate the effectiveness of SceneDreamer and superiority over state-of-the-art methods in generating vivid yet diverse unbounded 3D worlds.
翻译:本文提出SceneDreamer,一种用于无界三维场景的无条件生成模型,该模型能够从随机噪声中合成大规模三维景观。我们的框架仅从野外二维图像集合中学习,无需任何三维标注。SceneDreamer的核心在于一种原则性学习范式,包含:1)高效且富有表现力的三维场景表示,2)生成式场景参数化,以及3)能够利用二维图像知识的有效渲染器。我们的方法首先从单纯形噪声生成高效的鸟瞰图(BEV)表示,该表示包含用于地表高程的高度场和用于详细场景语义的语义场。这种BEV场景表示能够实现:1)以二次复杂度表示三维场景,2)解耦几何与语义,以及3)高效训练。此外,我们提出一种新颖的生成式神经哈希网格,基于三维位置和场景语义对潜在空间进行参数化,旨在编码跨不同场景的泛化特征。最后,采用通过对抗训练从二维图像集合中学习的神经体积渲染器,生成逼真的图像。大量实验证明了SceneDreamer的有效性,并展示了其在生成生动多样且无界三维世界方面相较于最先进方法的优越性。