Novel view synthesis from a single image requires inferring occluded regions of objects and scenes whilst simultaneously maintaining semantic and physical consistency with the input. Existing approaches condition neural radiance fields (NeRF) on local image features, projecting points to the input image plane, and aggregating 2D features to perform volume rendering. However, under severe occlusion, this projection fails to resolve uncertainty, resulting in blurry renderings that lack details. In this work, we propose NerfDiff, which addresses this issue by distilling the knowledge of a 3D-aware conditional diffusion model (CDM) into NeRF through synthesizing and refining a set of virtual views at test time. We further propose a novel NeRF-guided distillation algorithm that simultaneously generates 3D consistent virtual views from the CDM samples, and finetunes the NeRF based on the improved virtual views. Our approach significantly outperforms existing NeRF-based and geometry-free approaches on challenging datasets, including ShapeNet, ABO, and Clevr3D.
翻译:从单张图像进行新视角合成需要推断物体和场景的遮挡区域,同时保持与输入图像在语义和物理上的一致性。现有方法将神经辐射场(NeRF)与局部图像特征相结合,通过将空间点投影到输入图像平面并聚合二维特征来执行体渲染。然而,在严重遮挡情况下,这种投影方法无法有效解决不确定性,导致渲染结果模糊且缺乏细节。本文提出NerfDiff方法,通过将3D感知条件扩散模型(CDM)的知识蒸馏到NeRF中,在测试时合成并优化一组虚拟视角来解决此问题。我们进一步提出一种新颖的NeRF引导蒸馏算法,该算法能同时从CDM样本中生成3D一致的虚拟视角,并基于优化后的虚拟视角对NeRF进行微调。我们的方法在ShapeNet、ABO和Clevr3D等具有挑战性的数据集上显著优于现有的基于NeRF和无几何约束的方法。