This paper addresses a challenging question: How can we efficiently create high-quality, wide-scope 3D scenes from a single arbitrary image? Existing methods face several constraints, such as requiring multi-view data, time-consuming per-scene optimization, low visual quality in backgrounds, and distorted reconstructions in unseen areas. We propose a novel pipeline to overcome these limitations. Specifically, we introduce a large-scale reconstruction model that uses latents from a video diffusion model to predict 3D Gaussian Splattings for the scenes in a feed-forward manner. The video diffusion model is designed to create videos precisely following specified camera trajectories, allowing it to generate compressed video latents that contain multi-view information while maintaining 3D consistency. We train the 3D reconstruction model to operate on the video latent space with a progressive training strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes. Extensive evaluations across various datasets demonstrate that our model significantly outperforms existing methods for single-view 3D scene generation, particularly with out-of-domain images. For the first time, we demonstrate that a 3D reconstruction model can be effectively built upon the latent space of a diffusion model to realize efficient 3D scene generation.
翻译:本文探讨一个具有挑战性的问题:如何从一张任意给定的单张图像高效地创建高质量、大范围的三维场景?现有方法面临诸多限制,例如需要多视角数据、耗时的逐场景优化、背景视觉质量低下以及在未见区域出现扭曲重建。我们提出一种新颖的流程来克服这些局限。具体而言,我们引入一个大规模重建模型,该模型利用视频扩散模型的潜在表示以前馈方式预测场景的三维高斯溅射。该视频扩散模型被设计为能精确遵循指定相机轨迹生成视频,从而使其能够生成包含多视角信息且保持三维一致性的压缩视频潜在表示。我们采用渐进式训练策略,在视频潜在空间上训练三维重建模型,使其能够高效生成高质量、大范围且通用的三维场景。在多个数据集上的广泛评估表明,我们的模型在单视图三维场景生成任务上显著优于现有方法,尤其在处理域外图像时表现突出。我们首次证明,可以基于扩散模型的潜在空间有效构建三维重建模型,从而实现高效的三维场景生成。