Text-based 2D diffusion models have demonstrated impressive capabilities in image generation and editing. Meanwhile, the 2D diffusion models also exhibit substantial potentials for 3D editing tasks. However, how to achieve consistent edits across multiple viewpoints remains a challenge. While the iterative dataset update method is capable of achieving global consistency, it suffers from slow convergence and over-smoothed textures. We propose SyncNoise, a novel geometry-guided multi-view consistent noise editing approach for high-fidelity 3D scene editing. SyncNoise synchronously edits multiple views with 2D diffusion models while enforcing multi-view noise predictions to be geometrically consistent, which ensures global consistency in both semantic structure and low-frequency appearance. To further enhance local consistency in high-frequency details, we set a group of anchor views and propagate them to their neighboring frames through cross-view reprojection. To improve the reliability of multi-view correspondences, we introduce depth supervision during training to enhance the reconstruction of precise geometries. Our method achieves high-quality 3D editing results respecting the textual instructions, especially in scenes with complex textures, by enhancing geometric consistency at the noise and pixel levels.
翻译:基于文本的二维扩散模型在图像生成与编辑方面已展现出令人印象深刻的能力。与此同时,二维扩散模型在三维编辑任务中也显示出巨大潜力。然而,如何在多个视角之间实现一致的编辑仍然是一个挑战。虽然迭代数据集更新方法能够实现全局一致性,但其存在收敛速度慢和纹理过度平滑的问题。我们提出了SyncNoise,一种新颖的几何引导多视角一致噪声编辑方法,用于实现高保真度的三维场景编辑。SyncNoise利用二维扩散模型同步编辑多个视角,同时强制多视角噪声预测保持几何一致性,从而确保语义结构和低频外观的全局一致性。为了进一步增强高频细节的局部一致性,我们设置了一组锚定视角,并通过跨视角重投影将其传播到相邻帧。为了提高多视角对应关系的可靠性,我们在训练过程中引入深度监督以增强精确几何结构的重建。通过在噪声和像素层面增强几何一致性,我们的方法能够生成尊重文本指令的高质量三维编辑结果,尤其在具有复杂纹理的场景中表现优异。