Recent work on image content manipulation based on vision-language pre-training models has been effectively extended to text-driven 3D scene editing. However, existing schemes for 3D scene editing still exhibit certain shortcomings, hindering their further interactive design. Such schemes typically adhere to fixed input patterns, limiting users' flexibility in text input. Moreover, their editing capabilities are constrained by a single or a few 2D visual models and require intricate pipeline design to integrate these models into 3D reconstruction processes. To address the aforementioned issues, we propose a dialogue-based 3D scene editing approach, termed CE3D, which is centered around a large language model that allows for arbitrary textual input from users and interprets their intentions, subsequently facilitating the autonomous invocation of the corresponding visual expert models. Furthermore, we design a scheme utilizing Hash-Atlas to represent 3D scene views, which transfers the editing of 3D scenes onto 2D atlas images. This design achieves complete decoupling between the 2D editing and 3D reconstruction processes, enabling CE3D to flexibly integrate a wide range of existing 2D or 3D visual models without necessitating intricate fusion designs. Experimental results demonstrate that CE3D effectively integrates multiple visual models to achieve diverse editing visual effects, possessing strong scene comprehension and multi-round dialog capabilities. Code is available at <a href="https://sk-fun.fun/CE3D"> this https URL.</a>
翻译:基于视觉-语言预训练模型的图像内容操控工作近期已被有效拓展至文本驱动的三维场景编辑领域。然而,现有的三维场景编辑方案仍存在一些不足,阻碍了其进一步的交互式设计。这类方案通常遵循固定的输入模式,限制了用户在文本输入上的灵活性。此外,其编辑能力受限于单一或少数二维视觉模型,并且需要复杂的设计流程才能将这些模型集成到三维重建过程中。为解决上述问题,我们提出了一种基于对话的三维场景编辑方法,称为CE3D。该方法以一个大语言模型为核心,允许用户进行任意的文本输入并解析其意图,进而自主调用相应的视觉专家模型。此外,我们设计了一种利用哈希图集(Hash-Atlas)来表示三维场景视图的方案,将三维场景的编辑转换到二维图集图像上进行。这一设计实现了二维编辑过程与三维重建过程的完全解耦,使得CE3D能够灵活集成广泛的现有二维或三维视觉模型,而无需复杂的融合设计。实验结果表明,CE3D能够有效集成多个视觉模型以实现多样化的编辑视觉效果,并具备强大的场景理解与多轮对话能力。代码发布于 <a href="https://sk-fun.fun/CE3D">此 https URL。</a>