Generative models have shown great promise for novel view synthesis (NVS) by leveraging strong image generation priors. However, existing approaches typically follow a 2D inpainting paradigm, first completing missing image regions and then performing 3D reconstruction. This strategy often causes geometry distortion and appearance drift, as 2D inpainting models cannot reliably infer the underlying 3D structure required for cross-view consistent generation. In this paper, we propose \textbf{SceneCompleter}, a geometry-aware framework that reformulates generative NVS as dense 3D scene completion. Instead of hallucinating isolated 2D views, SceneCompleter jointly completes geometry and appearance through a geometry-appearance dual-stream diffusion model in a spatially aligned RGBD latent space. To provide holistic scene context, we further introduce a Scene Embedder that conditions generation on global semantic and stylistic information from reference images. The completed RGBD predictions are then aligned and integrated into an expandable 3D scene representation, enabling iterative and coherent scene completion. Extensive experiments on in-domain and out-of-distribution datasets demonstrate that SceneCompleter produces visually plausible and geometrically consistent novel views across diverse scenarios. Project Page: https://chen-wl20.github.io/SceneCompleter
翻译:摘要:生成式模型通过利用强大的图像生成先验知识,在新视角合成(NVS)任务上展现出巨大潜力。然而,现有方法通常遵循二维修补范式:先补全缺失的图像区域,再执行三维重建。这种策略常导致几何畸变与外观偏移,因为二维修补模型无法可靠推断实现跨视角一致性生成所需的三维底层结构。本文提出几何感知框架\textbf{SceneCompleter},将生成式新视角合成重新定义为密集三维场景补全任务。与独立生成孤立二维视图不同,SceneCompleter通过在空间对齐的RGBD潜空间中运行几何-外观双流扩散模型,联合补全几何结构与外观信息。为提供全局场景上下文,我们进一步引入场景编码器(Scene Embedder),基于参考图像的全局语义与风格信息调控生成过程。补全后的RGBD预测结果经过对齐与融合,形成可扩展的三维场景表征,从而实现迭代式且连贯的场景补全。在域内与域外数据集上的大量实验表明,SceneCompleter能够针对多样化场景生成视觉合理且几何一致的新视角合成结果。项目主页:https://chen-wl20.github.io/SceneCompleter