High-quality 3D scene reconstruction has recently advanced toward generalizable feed-forward architectures, enabling the generation of complex environments in a single forward pass. However, despite their strong performance in static scene perception, these models remain limited in responding to dynamic human instructions, which restricts their use in interactive applications. Existing editing methods typically rely on a 2D-lifting strategy, where individual views are edited independently and then lifted back into 3D space. This indirect pipeline often leads to blurry textures and inconsistent geometry, as 2D editors lack the spatial awareness required to preserve structure across viewpoints. To address these limitations, we propose VGGT-Edit, a feed-forward framework for text-conditioned native 3D scene editing. VGGT-Edit introduces depth-synchronized text injection to align semantic guidance with the backbone's spatial poses, ensuring stable instruction grounding. This semantic signal is then processed by a residual transformation head, which directly predicts 3D geometric displacements to deform the scene while preserving background stability. To ensure high-fidelity results, we supervise the framework with a multi-term objective function that enforces geometric accuracy and cross-view consistency. We also construct the DeltaScene Dataset, a large-scale dataset generated through an automated pipeline with 3D agreement filtering to ensure ground-truth quality. Experiments show that VGGT-Edit substantially outperforms 2D-lifting baselines, producing sharper object details, stronger multi-view consistency, and near-instant inference speed.
翻译:高质量的三维场景重建近期已向可泛化的前馈架构发展,使得在单次前向传播中生成复杂环境成为可能。然而,尽管这些模型在静态场景感知中表现出色,其在响应动态人类指令方面仍存在局限,限制了它们在交互应用中的使用。现有的编辑方法通常依赖二维提升策略,即独立编辑每个视角后再将结果提升回三维空间。这种间接流程常导致纹理模糊与几何不一致,因为二维编辑器缺乏跨视角保持结构所需的空间感知能力。为解决这些局限,我们提出VGGT-Edit——一种面向文本条件原生三维场景编辑的前馈框架。VGGT-Edit引入深度同步文本注入机制,将语义引导与骨干网络的空间位姿对齐,确保指令扎根的稳定性。该语义信号随后由残差变换头处理,直接预测三维几何位移以形变场景,同时保持背景稳定性。为确保高保真结果,我们采用包含多项目标的目标函数监督框架,强制几何精度与跨视角一致性。我们还构建DeltaScene数据集——通过自动化流程生成的大规模数据集,并采用三维一致性过滤确保真值质量。实验表明,VGGT-Edit显著优于二维提升基线方法,产生更锐利的物体细节、更强的多视角一致性,并实现近乎实时的推理速度。