Reconstructing 3D scenes from sparse, unposed images remains challenging under real-world conditions with varying illumination and transient occlusions. Existing methods rely on scene-specific optimization using appearance embeddings or dynamic masks, which requires extensive per-scene training and fails under sparse views. Moreover, evaluations on limited scenes raise questions about generalization. We present GenWildSplat, a feed-forward framework for sparse-view outdoor reconstruction that requires no per-scene optimization. Given unposed internet images, GenWildSplat predicts depth, camera parameters, and 3D Gaussians in a canonical space using learned geometric priors. An appearance adapter modulates appearance for target lighting conditions, while semantic segmentation handles transient objects. Through curriculum learning on synthetic and real data, GenWildSplat generalizes across diverse illumination and occlusion patterns. Evaluations on PhotoTourism and MegaScenes benchmark demonstrate state-of-the-art feed-forward rendering quality, achieving real-time inference without test-time optimization
翻译:从稀疏且未配准的图像中重建三维场景在真实世界条件下仍具挑战性,此类场景存在光照变化和瞬时遮挡。现有方法依赖基于外观嵌入或动态掩码的场景特定优化,需要大量逐场景训练,且在稀疏视角下表现不佳。此外,在有限场景上的评估引发了关于泛化性的疑问。我们提出GenWildSplat——一种无需逐场景优化的前馈式稀疏视角户外重建框架。给定未配准的互联网图像,GenWildSplat利用学习的几何先验,在规范空间中预测深度、相机参数和三维高斯体。外观适配器根据目标光照条件调节外观,语义分割模块则处理瞬时物体。通过对合成数据与真实数据实施课程学习,GenWildSplat可泛化至不同的光照与遮挡模式。在PhotoTourism和MegaScenes基准测试上的评估表明,该方法在前馈渲染质量上达到最先进水平,且无需测试时优化即可实现实时推理。