We introduce MVSplat360, a feed-forward approach for 360{\deg} novel view synthesis (NVS) of diverse real-world scenes, using only sparse observations. This setting is inherently ill-posed due to minimal overlap among input views and insufficient visual information provided, making it challenging for conventional methods to achieve high-quality results. Our MVSplat360 addresses this by effectively combining geometry-aware 3D reconstruction with temporally consistent video generation. Specifically, it refactors a feed-forward 3D Gaussian Splatting (3DGS) model to render features directly into the latent space of a pre-trained Stable Video Diffusion (SVD) model, where these features then act as pose and visual cues to guide the denoising process and produce photorealistic 3D-consistent views. Our model is end-to-end trainable and supports rendering arbitrary views with as few as 5 sparse input views. To evaluate MVSplat360's performance, we introduce a new benchmark using the challenging DL3DV-10K dataset, where MVSplat360 achieves superior visual quality compared to state-of-the-art methods on wide-sweeping or even 360{\deg} NVS tasks. Experiments on the existing benchmark RealEstate10K also confirm the effectiveness of our model. The video results are available on our project page: https://donydchen.github.io/mvsplat360.
翻译:我们提出MVSplat360,一种仅需稀疏观测即可实现多样化真实场景360度新视角合成的前馈方法。由于输入视图间重叠区域极少且提供的视觉信息不足,该设定本身具有不适定性,使得传统方法难以获得高质量结果。我们的MVSplat360通过有效结合几何感知的三维重建与时序一致的视频生成来解决此问题。具体而言,该方法重构前馈式三维高斯溅射模型,将其渲染特征直接映射至预训练Stable Video Diffusion模型的隐空间,这些特征随后作为姿态与视觉线索引导去噪过程,生成具有照片级真实感且三维一致的新视图。我们的模型支持端到端训练,仅需5个稀疏输入视图即可渲染任意视角。为评估MVSplat360的性能,我们基于具有挑战性的DL3DV-10K数据集构建了新基准测试,实验表明在广角乃至360度新视角合成任务中,MVSplat360相比现有最优方法实现了更卓越的视觉质量。在现有基准RealEstate10K上的实验也验证了我们模型的有效性。视频结果详见项目页面:https://donydchen.github.io/mvsplat360。