Text-driven 3D editing seeks to modify 3D scenes according to textual descriptions, and most existing approaches tackle this by adapting pre-trained 2D image editors to multi-view inputs. However, without explicit control over multi-view information exchange, they often fail to maintain cross-view consistency, leading to insufficient edits and blurry details. We introduce CoreEditor, a novel framework for consistent text-to-3D editing. The key innovation is a correspondence-constrained attention mechanism that enforces precise interactions between pixels expected to remain consistent throughout the diffusion denoising process. Beyond relying solely on geometric alignment, we further incorporate semantic similarity estimated during denoising, enabling more reliable correspondence modeling and robust multi-view editing. In addition, we design a selective editing pipeline that allows users to choose preferred results from multiple candidates, offering greater flexibility and user control. Extensive experiments show that CoreEditor produces high-quality, 3D-consistent edits with sharper details, significantly outperforming prior methods.
翻译:文本驱动的3D编辑旨在根据文本描述修改3D场景,现有方法大多通过将预训练的2D图像编辑器适配到多视角输入来解决此问题。然而,由于缺乏对多视角信息交换的显式控制,这些方法往往难以保持跨视角一致性,导致编辑不充分和细节模糊。我们提出了CoreEditor,一个用于实现一致文本到3D编辑的新型框架。其核心创新在于一种对应关系约束的注意力机制,该机制在扩散去噪过程中强制要求预期保持一致的像素之间进行精确交互。除了单纯依赖几何对齐,我们进一步融入了在去噪过程中估计的语义相似性,从而实现更可靠的对应关系建模和更鲁棒的多视角编辑。此外,我们设计了一个选择性编辑流程,允许用户从多个候选结果中选择偏好方案,提供了更大的灵活性和用户控制。大量实验表明,CoreEditor能够生成具有更清晰细节的高质量、3D一致的编辑结果,显著优于现有方法。