We propose SparseFusion, a sparse view 3D reconstruction approach that unifies recent advances in neural rendering and probabilistic image generation. Existing approaches typically build on neural rendering with re-projected features but fail to generate unseen regions or handle uncertainty under large viewpoint changes. Alternate methods treat this as a (probabilistic) 2D synthesis task, and while they can generate plausible 2D images, they do not infer a consistent underlying 3D. However, we find that this trade-off between 3D consistency and probabilistic image generation does not need to exist. In fact, we show that geometric consistency and generative inference can be complementary in a mode-seeking behavior. By distilling a 3D consistent scene representation from a view-conditioned latent diffusion model, we are able to recover a plausible 3D representation whose renderings are both accurate and realistic. We evaluate our approach across 51 categories in the CO3D dataset and show that it outperforms existing methods, in both distortion and perception metrics, for sparse-view novel view synthesis.
翻译:我们提出SparseFusion方法——一种融合神经渲染与概率图像生成最新进展的稀疏视角三维重建技术。现有方法通常基于重投影特征的神经渲染,但在大视角变化下难以生成未观测区域或处理不确定性。另一些方法将问题视为(概率性)二维合成任务,虽能生成合理的二维图像,却无法推断出具有一致性的基础三维结构。然而我们发现,三维一致性与概率图像生成之间的权衡并非必然存在,事实上几何一致性与生成式推理可以在模态寻优行为中形成互补关系。通过从视角条件隐式扩散模型中蒸馏出三维一致场景表征,我们能够恢复出渲染结果兼具准确性与真实感的可信三维表征。在CO3D数据集的51个类别上进行的评估表明,我们的方法在稀疏视角新视角合成任务的失真度指标和感知质量指标上均优于现有方法。