Given the steep learning curve of professional 3D software and the time-consuming process of managing large 3D assets, language-guided 3D scene editing has significant potential in fields such as virtual reality, augmented reality, and gaming. However, recent approaches to language-guided 3D scene editing either require manual interventions or focus only on appearance modifications without supporting comprehensive scene layout changes. In response, we propose EditRoom, a unified framework capable of executing a variety of layout edits through natural language commands, without requiring manual intervention. Specifically, EditRoom leverages Large Language Models (LLMs) for command planning and generates target scenes using a diffusion-based method, enabling six types of edits: rotate, translate, scale, replace, add, and remove. To address the lack of data for language-guided 3D scene editing, we have developed an automatic pipeline to augment existing 3D scene synthesis datasets and introduced EditRoom-DB, a large-scale dataset with 83k editing pairs, for training and evaluation. Our experiments demonstrate that our approach consistently outperforms other baselines across all metrics, indicating higher accuracy and coherence in language-guided scene layout editing.
翻译:鉴于专业三维软件学习曲线陡峭且大规模三维资产管理过程耗时,语言引导的三维场景编辑在虚拟现实、增强现实和游戏等领域具有显著潜力。然而,现有语言引导三维场景编辑方法或需人工干预,或仅支持外观修改而无法实现完整的场景布局变更。为此,我们提出EditRoom——一个能够通过自然语言指令执行多样化布局编辑的统一框架,且无需人工干预。具体而言,EditRoom利用大语言模型进行指令规划,并采用基于扩散的方法生成目标场景,支持旋转、平移、缩放、替换、添加和删除六类编辑操作。针对语言引导三维场景编辑数据匮乏的问题,我们开发了自动化流程来扩展现有三维场景合成数据集,并构建了包含8.3万组编辑对的大规模数据集EditRoom-DB用于训练与评估。实验表明,我们的方法在所有评估指标上均持续优于其他基线模型,在语言引导场景布局编辑中展现出更高的准确性与连贯性。