3D-aware image synthesis encompasses a variety of tasks, such as scene generation and novel view synthesis from images. Despite numerous task-specific methods, developing a comprehensive model remains challenging. In this paper, we present SSDNeRF, a unified approach that employs an expressive diffusion model to learn a generalizable prior of neural radiance fields (NeRF) from multi-view images of diverse objects. Previous studies have used two-stage approaches that rely on pretrained NeRFs as real data to train diffusion models. In contrast, we propose a new single-stage training paradigm with an end-to-end objective that jointly optimizes a NeRF auto-decoder and a latent diffusion model, enabling simultaneous 3D reconstruction and prior learning, even from sparsely available views. At test time, we can directly sample the diffusion prior for unconditional generation, or combine it with arbitrary observations of unseen objects for NeRF reconstruction. SSDNeRF demonstrates robust results comparable to or better than leading task-specific methods in unconditional generation and single/sparse-view 3D reconstruction.
翻译:三维感知图像合成涵盖多种任务,例如场景生成和基于图像的新视角合成。尽管存在大量任务特定的方法,但构建一个通用模型仍具挑战性。在本文中,我们提出SSDNeRF——一种统一方法,它利用表达性扩散模型从多样化物体的多视角图像中学习神经辐射场(NeRF)的可泛化先验。以往研究采用两阶段方法,依赖预训练的NeRF作为真实数据来训练扩散模型。相比之下,我们提出一种新的单阶段训练范式,通过端到端目标联合优化NeRF自动解码器和潜在扩散模型,从而能够同时实现三维重建和先验学习,即使从稀疏可用的视角输入也能有效执行。在测试阶段,我们可以直接采样扩散先验进行无条件生成,或将其与未知物体的任意观测结合以进行NeRF重建。SSDNeRF在无条件生成和单视图/稀疏视图三维重建任务中,展现出与领先任务特定方法相当或更优的稳健结果。