We present an inference-time diffusion sampling method to perform multi-view consistent image editing using pre-trained 2D image editing models. These models can independently produce high-quality edits for each image in a set of multi-view images of a 3D scene or object, but they do not maintain consistency across views. Existing approaches typically address this by optimizing over explicit 3D representations, but they suffer from a lengthy optimization process and instability under sparse view settings. We propose an implicit 3D regularization approach by constraining the generated 2D image sequences to adhere to a pre-trained multi-view image distribution. This is achieved through coupled diffusion sampling, a simple diffusion sampling technique that concurrently samples two trajectories from both a multi-view image distribution and a 2D edited image distribution, using a coupling term to enforce the multi-view consistency among the generated images. We validate the effectiveness and generality of this framework on three distinct multi-view image editing tasks, demonstrating its applicability across various model architectures and highlighting its potential as a general solution for multi-view consistent editing.
翻译:我们提出一种推理阶段的扩散采样方法,利用预训练的二维图像编辑模型实现多视角一致的图像编辑。这些模型能够独立地对三维场景或物体的多视角图像集中的每张图像生成高质量编辑结果,但无法保持跨视角的一致性。现有方法通常通过对显式三维表示进行优化来解决此问题,但存在优化过程冗长且在稀疏视角设置下不稳定的缺陷。我们提出一种隐式三维正则化方法,通过约束生成的二维图像序列使其符合预训练的多视角图像分布。这通过耦合扩散采样实现——一种简单的扩散采样技术,可同时从多视角图像分布和二维编辑图像分布中采样两条轨迹,并利用耦合项强制生成图像间的多视角一致性。我们在三种不同的多视角图像编辑任务上验证了该框架的有效性与通用性,证明了其在不同模型架构间的适用性,并凸显了其作为多视角一致编辑通用解决方案的潜力。