Denoising Diffusion via Image-Based Rendering

Generating 3D scenes is a challenging open problem, which requires synthesizing plausible content that is fully consistent in 3D space. While recent methods such as neural radiance fields excel at view synthesis and 3D reconstruction, they cannot synthesize plausible details in unobserved regions since they lack a generative capability. Conversely, existing generative methods are typically not capable of reconstructing detailed, large-scale scenes in the wild, as they use limited-capacity 3D scene representations, require aligned camera poses, or rely on additional regularizers. In this work, we introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes. To achieve this, we make three contributions. First, we introduce a new neural scene representation, IB-planes, that can efficiently and accurately represent large 3D scenes, dynamically allocating more capacity as needed to capture details visible in each image. Second, we propose a denoising-diffusion framework to learn a prior over this novel 3D scene representation, using only 2D images without the need for any additional supervision signal such as masks or depths. This supports 3D reconstruction and generation in a unified architecture. Third, we develop a principled approach to avoid trivial 3D solutions when integrating the image-based rendering with the diffusion model, by dropping out representations of some images. We evaluate the model on several challenging datasets of real and synthetic images, and demonstrate superior results on generation, novel view synthesis and 3D reconstruction.

翻译：生成三维场景是一个具有挑战性的开放问题，需要在三维空间中合成完全一致的可信内容。尽管神经辐射场等最新方法在视图合成和三维重建方面表现出色，但由于缺乏生成能力，无法在未观测区域合成可信细节。相比之下，现有的生成方法通常无法在真实场景中重建详细的大规模场景，因为它们使用容量有限的三维场景表示、需要对齐的相机位姿，或依赖额外的正则化项。在本工作中，我们首次引入一种能够对真实世界三维场景进行快速、详细重建和生成的扩散模型。为此，我们做出三项贡献。首先，我们提出一种新的神经场景表示——IB-planes，它能高效准确地表示大型三维场景，并根据需求动态分配更多容量以捕捉每张图像中可见的细节。其次，我们提出一个去噪扩散框架，仅使用二维图像学习这种新型三维场景表示的先验，无需任何额外的监督信号（如掩码或深度），从而在统一架构中支持三维重建和生成。第三，我们开发了一种原则性方法，通过丢弃部分图像的表示，避免将基于图像渲染与扩散模型集成时出现平凡的三维解。我们在多个具有挑战性的真实和合成图像数据集上评估该模型，并在生成、新视图合成和三维重建方面展示了优越的结果。