Feed-forward view synthesis models predict a novel view in a single pass with minimal 3D inductive bias. Existing works encode cameras as Plücker ray maps, which tie predictions to the arbitrary world coordinate gauge and make them sensitive to small camera transformations, thereby undermining geometric consistency. In this paper, we ask what inputs best condition a model for robust and consistent view synthesis. We propose projective conditioning, which replaces raw camera parameters with a target-view projective cue that provides a stable 2D input. This reframes the task from a brittle geometric regression problem in ray space to a well-conditioned target-view image-to-image translation problem. Additionally, we introduce a masked autoencoding pretraining strategy tailored to this cue, enabling the use of large-scale uncalibrated data for pretraining. Our method shows improved fidelity and stronger cross-view consistency compared to ray-conditioned baselines on our view-consistency benchmark. It also achieves state-of-the-art quality on standard novel view synthesis benchmarks.
翻译:前馈视图合成模型以最小的三维归纳偏置,通过单次前向传播预测新视角。现有方法将相机编码为普吕克光线图,这种方法将预测结果与任意的世界坐标规范绑定,使其对微小的相机变换敏感,从而削弱了几何一致性。本文探讨何种输入条件能够最有效地引导模型实现鲁棒且一致的视图合成。我们提出投影条件化方法,该方法用目标视角的投影提示替代原始相机参数,提供稳定的二维输入。这将任务从光线空间中脆弱的几何回归问题,重新定义为条件良好的目标视角图像到图像的转换问题。此外,我们针对该提示设计了一种掩码自编码预训练策略,使得能够利用大规模未标定数据进行预训练。在我们的视图一致性基准测试中,与基于光线条件化的基线方法相比,我们的方法显示出更高的保真度和更强的跨视图一致性。该方法在标准的新视角合成基准测试中也达到了最先进的生成质量。