Existing 3D scene editing methods typically rely on per-scene optimization over explicit 3D representations or cascaded edit-and-reconstruct pipelines, resulting in high test-time cost, limited 3D awareness, and structural inconsistencies. To couple appearance synthesis and geometry prediction during editing, we build on a unified RGB-geometry reconstruction-generation latent space and adapt it to feed-forward 3D scene editing. The resulting framework, \textbf{JointEdit3D}, performs asymmetric latent inpainting by observing only a single edited RGB reference latent and generating the remaining RGB views and edited geometry latent under source-scene anchoring. JointEdit3D introduces a dedicated SceneAnchor Branch to inject source-scene structure without forcing direct copying, and adopts edit/background-aware losses to balance edited-region fidelity with unedited-content preservation. To address the lack of paired resources for standardized 3D scene editing evaluation, we introduce SceneEdit3D-15K, a dataset with 15K paired editing samples and renderer-provided 3D annotations, together with SceneEdit3D-Bench, a curated 100-sample benchmark. Experiments show that JointEdit3D improves edited-region quality and 3D structural completeness over prior baselines while maintaining competitive background preservation.
翻译:摘要:现有三维场景编辑方法通常依赖基于显式三维表示的逐场景优化或级联式编辑-重建流水线,导致测试阶段计算成本高、三维感知能力有限及结构不一致性。为在编辑过程中联合外观合成与几何预测,我们构建于统一RGB-几何重建生成潜空间,并将其适配至前馈式三维场景编辑。由此产生的框架**JointEdit3D**通过仅观测单个已编辑RGB参考潜变量,并在源场景锚定下生成其余RGB视图及编辑后的几何潜变量,实现非对称潜空间修复。JointEdit3D引入专用场景锚分支(SceneAnchor Branch),在不强制直接复制的情况下注入源场景结构,并采用编辑/背景感知损失以平衡编辑区域保真度与未编辑内容保持性。针对标准化三维场景编辑评估中配对资源缺失的问题,我们提出包含15K个配对编辑样本及渲染器提供三维标注的SceneEdit3D-15K数据集,以及包含100个精筛选样本的SceneEdit3D-Bench基准。实验表明,JointEdit3D在保持竞争力的背景保留性能的同时,相较先前基线方法提升了编辑区域质量与三维结构完整性。