We introduce a method for novel view synthesis given only a single wide-baseline stereo image pair. In this challenging regime, 3D scene points are regularly observed only once, requiring prior-based reconstruction of scene geometry and appearance. We find that existing approaches to novel view synthesis from sparse observations fail due to recovering incorrect 3D geometry and due to the high cost of differentiable rendering that precludes their scaling to large-scale training. We take a step towards resolving these shortcomings by formulating a multi-view transformer encoder, proposing an efficient, image-space epipolar line sampling scheme to assemble image features for a target ray, and a lightweight cross-attention-based renderer. Our contributions enable training of our method on a large-scale real-world dataset of indoor and outdoor scenes. We demonstrate that our method learns powerful multi-view geometry priors while reducing the rendering time. We conduct extensive comparisons on held-out test scenes across two real-world datasets, significantly outperforming prior work on novel view synthesis from sparse image observations and achieving multi-view-consistent novel view synthesis.
翻译:我们提出一种仅利用单个宽基线立体图像对进行新视角合成的方法。在这种具有挑战性的设定下,三维场景点通常仅被观测一次,因此需要对场景几何与外观进行基于先验的重建。我们发现,现有基于稀疏观测的新视角合成方法因恢复出错误的三维几何,且可微渲染的高昂成本阻碍其扩展至大规模训练,而表现不佳。为解决这些缺陷,我们通过构建多视角变换编码器、提出高效的图像空间极线采样方案来为目标光线组装图像特征,以及设计轻量级的基于交叉注意力机制的渲染器,迈出了关键一步。这些创新使得我们的方法能够在包含室内外场景的大规模真实世界数据集上进行训练。我们证明,该方法在减少渲染时间的同时,习得了强大的多视角几何先验。我们在两个真实世界数据集的保留测试场景上进行了广泛比较,显著超越了先前基于稀疏图像观测的新视角合成方法,并实现了多视角一致的新视角合成。