Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation. Trained exclusively on in-the-wild images, OVIE outperforms prior methods in a zero-shot setting, while being 600x faster than the second-best baseline. Code and models are publicly available at https://github.com/AdrienRR/ovie.
翻译:单目新视角合成长期依赖多视图图像对进行监督,这限制了训练数据的规模与多样性。我们认为这并非必要:一视足矣。我们提出OVIE模型,完全基于非配对的互联网图像进行训练。在训练阶段,我们利用单目深度估计器作为几何支撑:将源图像提升至三维空间,施加采样的相机变换,并通过投影获取伪目标视角。为解决遮挡问题,我们引入掩膜训练框架,将几何、感知与纹理损失限制在有效区域内,从而实现在3000万张未经筛选的图像上进行训练。推理时,OVIE无需深度估计器或三维表示,完全摆脱几何依赖。仅使用野外图像训练的OVIE,在零样本设置下超越先前方法,且速度比次优基线快600倍。代码与模型已开源至https://github.com/AdrienRR/ovie。