Recent advances in 3D representations, such as Neural Radiance Fields and 3D Gaussian Splatting, have greatly improved realistic scene modeling and novel-view synthesis. However, achieving controllable and consistent editing in dynamic 3D scenes remains a significant challenge. Previous work is largely constrained by its editing backbones, resulting in inconsistent edits and limited controllability. In our work, we introduce a novel framework that first fine-tunes the InstructPix2Pix model, followed by a two-stage optimization of the scene based on deformable 3D Gaussians. Our fine-tuning enables the model to "learn" the editing ability from a single edited reference image, transforming the complex task of dynamic scene editing into a simple 2D image editing process. By directly learning editing regions and styles from the reference, our approach enables consistent and precise local edits without the need for tracking desired editing regions, effectively addressing key challenges in dynamic scene editing. Then, our two-stage optimization progressively edits the trained dynamic scene, using a designed edited image buffer to accelerate convergence and improve temporal consistency. Compared to state-of-the-art methods, our approach offers more flexible and controllable local scene editing, achieving high-quality and consistent results.
翻译:近年来,三维表示方法(如神经辐射场和三维高斯泼溅)的进展极大地提升了场景建模的真实性与新视角合成的质量。然而,在动态三维场景中实现可控且一致的编辑仍是一个重大挑战。先前的研究在很大程度上受限于其编辑主干网络,导致编辑结果不一致且可控性有限。本文提出一种新颖的框架:首先对InstructPix2Pix模型进行微调,随后基于可变形三维高斯对场景进行两阶段优化。我们的微调使模型能够从单张编辑后的参考图像中“学习”编辑能力,从而将复杂的动态场景编辑任务转化为简单的二维图像编辑过程。通过直接从参考图像中学习编辑区域与风格,本方法无需追踪目标编辑区域即可实现一致且精确的局部编辑,有效解决了动态场景编辑中的关键难题。随后,我们设计的两阶段优化过程通过构建的编辑图像缓冲区逐步编辑训练后的动态场景,加速收敛并提升时序一致性。与现有先进方法相比,本方法提供了更灵活可控的局部场景编辑能力,并实现了高质量且一致的编辑效果。