We present a diffusion-based model for 3D-aware generative novel view synthesis from as few as a single input image. Our model samples from the distribution of possible renderings consistent with the input and, even in the presence of ambiguity, is capable of rendering diverse and plausible novel views. To achieve this, our method makes use of existing 2D diffusion backbones but, crucially, incorporates geometry priors in the form of a 3D feature volume. This latent feature field captures the distribution over possible scene representations and improves our method's ability to generate view-consistent novel renderings. In addition to generating novel views, our method has the ability to autoregressively synthesize 3D-consistent sequences. We demonstrate state-of-the-art results on synthetic renderings and room-scale scenes; we also show compelling results for challenging, real-world objects.
翻译:我们提出了一种基于扩散的模型,用于从单张输入图像出发进行3D感知的生成式新视角合成。该模型从与输入一致的可能的渲染分布中采样,即使在存在歧义的情况下,也能生成多样化且合理的新视角。为实现这一目标,我们的方法利用了现有的2D扩散骨干网络,但关键之处在于,它以3D特征体积的形式融合了几何先验。这种潜在特征场捕捉了可能场景表示的分布,提升了模型生成视角一致的新渲染结果的能力。除了生成新视角外,我们的方法还具有自回归合成3D一致序列的能力。我们在合成渲染图和房间尺度场景中展示了最先进的结果,并在具有挑战性的真实世界物体上也取得了令人信服的效果。