Novel View Synthesis (NVS) and 3D generation have recently achieved prominent improvements. However, these works mainly focus on confined categories or synthetic 3D assets, which are discouraged from generalizing to challenging in-the-wild scenes and fail to be employed with 2D synthesis directly. Moreover, these methods heavily depended on camera poses, limiting their real-world applications. To overcome these issues, we propose MVInpainter, re-formulating the 3D editing as a multi-view 2D inpainting task. Specifically, MVInpainter partially inpaints multi-view images with the reference guidance rather than intractably generating an entirely novel view from scratch, which largely simplifies the difficulty of in-the-wild NVS and leverages unmasked clues instead of explicit pose conditions. To ensure cross-view consistency, MVInpainter is enhanced by video priors from motion components and appearance guidance from concatenated reference key&value attention. Furthermore, MVInpainter incorporates slot attention to aggregate high-level optical flow features from unmasked regions to control the camera movement with pose-free training and inference. Sufficient scene-level experiments on both object-centric and forward-facing datasets verify the effectiveness of MVInpainter, including diverse tasks, such as multi-view object removal, synthesis, insertion, and replacement. The project page is https://ewrfcas.github.io/MVInpainter/.
翻译:新视角合成与3D生成领域近期取得了显著进展。然而,现有工作主要集中于受限类别或合成3D资产,难以推广至复杂的真实场景,且无法直接与2D合成技术结合。此外,这些方法严重依赖相机位姿,限制了其实际应用。为克服这些问题,我们提出MVInpainter,将3D编辑任务重新构建为多视角2D修复问题。具体而言,MVInpainter通过参考引导对多视角图像进行局部修复,而非直接生成完整新视角,这大幅降低了真实场景新视角合成的难度,并利用未掩码线索替代显式的位姿条件。为确保跨视角一致性,MVInpainter通过运动分量的视频先验知识与级联参考键值注意力的外观引导进行增强。此外,模型引入槽注意力机制,从未掩码区域聚合高层光流特征,以实现无需位姿条件的训练与推理。在面向对象数据集和正向场景数据集上的充分实验验证了MVInpainter的有效性,涵盖多视角物体移除、合成、插入及替换等多样化任务。项目页面详见 https://ewrfcas.github.io/MVInpainter/。