Automatically generating high-quality real world 3D scenes is of enormous interest for applications such as virtual reality and robotics simulation. Towards this goal, we introduce NeuralField-LDM, a generative model capable of synthesizing complex 3D environments. We leverage Latent Diffusion Models that have been successfully utilized for efficient high-quality 2D content creation. We first train a scene auto-encoder to express a set of image and pose pairs as a neural field, represented as density and feature voxel grids that can be projected to produce novel views of the scene. To further compress this representation, we train a latent-autoencoder that maps the voxel grids to a set of latent representations. A hierarchical diffusion model is then fit to the latents to complete the scene generation pipeline. We achieve a substantial improvement over existing state-of-the-art scene generation models. Additionally, we show how NeuralField-LDM can be used for a variety of 3D content creation applications, including conditional scene generation, scene inpainting and scene style manipulation.
翻译:自动生成高质量真实世界三维场景对于虚拟现实和机器人仿真等应用具有重大意义。为实现这一目标,我们提出NeuralField-LDM——一种能够合成复杂三维环境的生成模型。我们借鉴了已成功用于高效高质量二维内容创作的潜扩散模型。首先训练场景自编码器,将一组图像与位姿对表达为神经辐射场,该场以密度和特征体素网格的形式呈现,可通过投影生成场景的新视角。为进一步压缩该表示,我们训练潜自编码器将体素网格映射至一组潜表示。随后,对潜表示拟合分层扩散模型,完成场景生成流程。相较于现有最先进的场景生成模型,我们实现了显著提升。此外,我们展示了NeuralField-LDM在多种三维内容创作中的应用,包括条件场景生成、场景修复及场景风格操控。