Text-driven 3D editing seeks to modify 3D scenes according to textual descriptions, and most existing approaches tackle this by adapting pre-trained 2D image editors to multi-view inputs. However, without explicit control over multi-view information exchange, they often fail to maintain cross-view consistency, leading to insufficient edits and blurry details. We introduce CoreEditor, a novel framework for consistent text-to-3D editing. The key innovation is a correspondence-constrained attention mechanism that enforces precise interactions between pixels expected to remain consistent throughout the diffusion denoising process. Beyond relying solely on geometric alignment, we further incorporate semantic similarity estimated during denoising, enabling more reliable correspondence modeling and robust multi-view editing. In addition, we design a selective editing pipeline that allows users to choose preferred results from multiple candidates, offering greater flexibility and user control. Extensive experiments show that CoreEditor produces high-quality, 3D-consistent edits with sharper details, significantly outperforming prior methods.
翻译:文本驱动的三维编辑旨在根据文本描述修改三维场景,现有方法大多通过将预训练的二维图像编辑器适配于多视图输入来实现这一目标。然而,由于缺乏对多视图信息交换的显式控制,这些方法往往难以保持跨视图一致性,导致编辑效果不充分且细节模糊。本文提出CoreEditor,一种用于实现一致性文本到三维编辑的新型框架。其核心创新在于一种对应约束注意力机制,该机制在扩散去噪过程中强制要求预期保持一致的像素间进行精确交互。除了单纯依赖几何对齐外,我们进一步融入了去噪过程中估计的语义相似性,从而实现更可靠的对应关系建模和更鲁棒的多视图编辑。此外,我们设计了一种选择性编辑流程,允许用户从多个候选结果中选择偏好方案,提供了更高的灵活性和用户控制权。大量实验表明,CoreEditor能够生成具有更清晰细节的高质量、三维一致的编辑结果,显著优于现有方法。