We propose UpFusion, a system that can perform novel view synthesis and infer 3D representations for an object given a sparse set of reference images without corresponding pose information. Current sparse-view 3D inference methods typically rely on camera poses to geometrically aggregate information from input views, but are not robust in-the-wild when such information is unavailable/inaccurate. In contrast, UpFusion sidesteps this requirement by learning to implicitly leverage the available images as context in a conditional generative model for synthesizing novel views. We incorporate two complementary forms of conditioning into diffusion models for leveraging the input views: a) via inferring query-view aligned features using a scene-level transformer, b) via intermediate attentional layers that can directly observe the input image tokens. We show that this mechanism allows generating high-fidelity novel views while improving the synthesis quality given additional (unposed) images. We evaluate our approach on the Co3Dv2 and Google Scanned Objects datasets and demonstrate the benefits of our method over pose-reliant sparse-view methods as well as single-view methods that cannot leverage additional views. Finally, we also show that our learned model can generalize beyond the training categories and even allow reconstruction from self-captured images of generic objects in-the-wild.
翻译:我们提出UpFusion,一个能够从缺乏对应姿态信息的稀疏参考图像集中进行新视角合成并推断物体三维表示的系统。当前的稀疏视角三维推断方法通常依赖相机姿态来几何聚合输入视角的信息,但当此类信息不可用或不准确时,在开放场景中缺乏鲁棒性。相反,UpFusion通过隐式学习将可用图像作为条件生成模型中的上下文来规避这一需求,从而实现新视角合成。我们将两种互补的条件机制融入扩散模型以利用输入视角:a)通过场景级Transformer推断查询视角对齐的特征;b)通过能够直接观察输入图像标记的中间注意力层。我们证明,该机制能够生成高保真新视角,同时在额外(无姿态)图像输入下提升合成质量。我们在Co3Dv2和谷歌扫描物体数据集上评估了该方法,并展示了其相对于依赖姿态的稀疏视角方法以及无法利用额外视角的单视角方法的优势。最后,我们还表明所学模型能够泛化到训练类别之外,甚至允许从开放场景中自捕获的通用物体图像进行重建。