We introduce a 3D-aware diffusion model, ZeroNVS, for single-image novel view synthesis for in-the-wild scenes. While existing methods are designed for single objects with masked backgrounds, we propose new techniques to address challenges introduced by in-the-wild multi-object scenes with complex backgrounds. Specifically, we train a generative prior on a mixture of data sources that capture object-centric, indoor, and outdoor scenes. To address issues from data mixture such as depth-scale ambiguity, we propose a novel camera conditioning parameterization and normalization scheme. Further, we observe that Score Distillation Sampling (SDS) tends to truncate the distribution of complex backgrounds during distillation of 360-degree scenes, and propose "SDS anchoring" to improve the diversity of synthesized novel views. Our model sets a new state-of-the-art result in LPIPS on the DTU dataset in the zero-shot setting, even outperforming methods specifically trained on DTU. We further adapt the challenging Mip-NeRF 360 dataset as a new benchmark for single-image novel view synthesis, and demonstrate strong performance in this setting. Our code and data are at http://kylesargent.github.io/zeronvs/
翻译:我们提出了一种三维感知扩散模型ZeroNVS,用于野外场景的单张图像新视角合成。现有方法通常针对具有遮蔽背景的单一物体设计,而本文提出新技术以应对野外多物体场景中复杂背景带来的挑战。具体而言,我们基于混合数据源(涵盖以物体为中心、室内及室外场景)训练生成先验。针对数据混合导致的深度-尺度歧义问题,我们提出了一种新颖的相机条件参数化与归一化方案。此外,我们观察到在360度场景蒸馏过程中,分数蒸馏采样(SDS)倾向于截断复杂背景的分布,为此提出“SDS锚定”机制以提升合成新视角的多样性。在零样本设定下,我们的模型在DTU数据集上的LPIPS指标达到当前最优水平,甚至优于专为该数据集训练的方法。我们进一步将具有挑战性的Mip-NeRF 360数据集改造为单图像新视角合成的新基准,并在该设定下展现了强劲性能。代码与数据已开源至 http://kylesargent.github.io/zeronvs/