Drag View: Generalizable Novel View Synthesis with Unposed Imagery

We introduce DragView, a novel and interactive framework for generating novel views of unseen scenes. DragView initializes the new view from a single source image, and the rendering is supported by a sparse set of unposed multi-view images, all seamlessly executed within a single feed-forward pass. Our approach begins with users dragging a source view through a local relative coordinate system. Pixel-aligned features are obtained by projecting the sampled 3D points along the target ray onto the source view. We then incorporate a view-dependent modulation layer to effectively handle occlusion during the projection. Additionally, we broaden the epipolar attention mechanism to encompass all source pixels, facilitating the aggregation of initialized coordinate-aligned point features from other unposed views. Finally, we employ another transformer to decode ray features into final pixel intensities. Crucially, our framework does not rely on either 2D prior models or the explicit estimation of camera poses. During testing, DragView showcases the capability to generalize to new scenes unseen during training, also utilizing only unposed support images, enabling the generation of photo-realistic new views characterized by flexible camera trajectories. In our experiments, we conduct a comprehensive comparison of the performance of DragView with recent scene representation networks operating under pose-free conditions, as well as with generalizable NeRFs subject to noisy test camera poses. DragView consistently demonstrates its superior performance in view synthesis quality, while also being more user-friendly. Project page: https://zhiwenfan.github.io/DragView/.

翻译：我们提出了DragView，一种新颖且交互式的框架，用于生成未见场景的新视角。DragView从单一源图像初始化新视角，其渲染由一组稀疏的无姿态多视图图像支持，所有过程均通过单次前向传播无缝完成。我们的方法从用户通过局部相对坐标系拖动源视图开始。通过将沿目标射线的采样三维点投影到源视图上，获得像素对齐的特征。随后，我们引入一个依赖于视角的调制层，以有效处理投影过程中的遮挡问题。此外，我们将对极注意力机制扩展至所有源像素，从而促进从其他无姿态视图中聚合初始化的坐标对齐点特征。最后，我们采用另一个Transformer将射线特征解码为最终的像素强度。关键在于，我们的框架既不依赖于二维先验模型，也不依赖于相机姿态的显式估计。在测试阶段，DragView展示了泛化至训练中未见的新场景的能力，同样仅利用无姿态的支持图像，从而能够生成具有灵活相机轨迹的逼重新视角。在实验中，我们将DragView的性能与近年来在无姿态条件下运行的场景表示网络，以及受噪声测试相机姿态影响的可泛化NeRF进行了全面比较。DragView在视角合成质量上持续展现出其优越性能，同时更具用户友好性。项目页面：https://zhiwenfan.github.io/DragView/。