3D-aware image synthesis encompasses a variety of tasks, such as scene generation and novel view synthesis from images. Despite numerous task-specific methods, developing a comprehensive model remains challenging. In this paper, we present SSDNeRF, a unified approach that employs an expressive diffusion model to learn a generalizable prior of neural radiance fields (NeRF) from multi-view images of diverse objects. Previous studies have used two-stage approaches that rely on pretrained NeRFs as real data to train diffusion models. In contrast, we propose a new single-stage training paradigm with an end-to-end objective that jointly optimizes a NeRF auto-decoder and a latent diffusion model, enabling simultaneous 3D reconstruction and prior learning, even from sparsely available views. At test time, we can directly sample the diffusion prior for unconditional generation, or combine it with arbitrary observations of unseen objects for NeRF reconstruction. SSDNeRF demonstrates robust results comparable to or better than leading task-specific methods in unconditional generation and single/sparse-view 3D reconstruction.
翻译:三维感知图像合成涵盖了从场景生成到图像新视角合成等多种任务。尽管已有众多针对特定任务的方法,但开发通用模型仍具挑战性。本文提出SSDNeRF,一种统一方法,通过使用表达性扩散模型从多样化物体的多视角图像中学习神经辐射场(NeRF)的可泛化先验。先前研究采用两阶段方法,依赖预训练NeRF作为真实数据训练扩散模型。与此不同,我们提出一种新的单阶段训练范式,通过端到端目标联合优化NeRF自解码器与潜扩散模型,即使在稀疏视角下也能实现三维重建与先验学习的同步进行。测试时,我们可直接采样扩散先验进行无条件生成,或将其与未知物体的任意观测结合进行NeRF重建。SSDNeRF在无条件生成和单/稀疏视角三维重建任务中展现出鲁棒结果,性能与领先的任务专用方法相当或更优。