Denoising Diffusion via Image-Based Rendering

Generating 3D scenes is a challenging open problem, which requires synthesizing plausible content that is fully consistent in 3D space. While recent methods such as neural radiance fields excel at view synthesis and 3D reconstruction, they cannot synthesize plausible details in unobserved regions since they lack a generative capability. Conversely, existing generative methods are typically not capable of reconstructing detailed, large-scale scenes in the wild, as they use limited-capacity 3D scene representations, require aligned camera poses, or rely on additional regularizers. In this work, we introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes. To achieve this, we make three contributions. First, we introduce a new neural scene representation, IB-planes, that can efficiently and accurately represent large 3D scenes, dynamically allocating more capacity as needed to capture details visible in each image. Second, we propose a denoising-diffusion framework to learn a prior over this novel 3D scene representation, using only 2D images without the need for any additional supervision signal such as masks or depths. This supports 3D reconstruction and generation in a unified architecture. Third, we develop a principled approach to avoid trivial 3D solutions when integrating the image-based rendering with the diffusion model, by dropping out representations of some images. We evaluate the model on several challenging datasets of real and synthetic images, and demonstrate superior results on generation, novel view synthesis and 3D reconstruction.

翻译：生成三维场景是一项具有挑战性的开放问题，需要合成在三维空间中完全一致的合理内容。尽管神经辐射场等最新方法在视图合成和三维重建方面表现出色，但由于缺乏生成能力，无法在未观测区域合成合理的细节。反之，现有的生成方法通常无法重建野外详细的大规模场景，因为它们使用容量有限的三维场景表示、需要对齐的相机位姿或依赖额外的正则化项。本文首次引入了一种扩散模型，能够对真实世界三维场景进行快速、详细的的重建和生成。为实现这一目标，我们做出三项贡献。首先，引入了一种新的神经场景表示方法IB-planes，它能高效准确地表示大型三维场景，并根据需要动态分配更多容量以捕获每张图像中可见的细节。其次，提出了一种去噪扩散框架，仅使用二维图像即可学习这种新型三维场景表示的先验，无需遮罩或深度等任何额外监督信号，从而在统一架构中支持三维重建与生成。第三，开发了一种原则性方法，通过丢弃部分图像的表示来避免将基于图像渲染与扩散模型集成时的平凡三维解。我们在多个具有挑战性的真实与合成图像数据集上评估了该模型，并在生成、新视图合成和三维重建方面展示了优越的结果。