MUSE: Agentic 3D Scene Authoring via Memory-Grounded Incremental Requirement Satisfaction

Text-driven 3D scene generation is a promising technique for digital content creation, embodied AI simulation, and interactive design, yet practical workflows often require refining, extending, or correcting existing scenes while preserving non-target content. Existing methods can produce realistic and structurally plausible scenes, but they generally lack editability with requirement-level state tracking, so part-level failures often lead to full-scene regeneration or manual intervention. To tackle this challenge, we formulate controllable 3D scene authoring as incremental requirement satisfaction, unifying construction and editing. In this paper, we present MUSE, a memory-grounded multi-agent framework in which an Architect compiles instructions into structured requirements, a Sculptor executes local scene operations, and an Inspector verifies each step while updating Working, Scene, and Skill Memory. To evaluate requirement-level controllability and preservation-aware editing, we introduce AuthorBench, offering 145 constrained construction cases and a 1,584-case preservation-aware editing pool paired with external structured checks. On full construction cases, MUSE improves All-Goal success from 37.9 to 80.7 and surface-constraint fulfillment from 35.0 to 92.6 over the strongest baseline. On a stratified 240-case editing test split, MUSE achieves 49.6 All-Goal success, 99.9 preservation rate, and only 0.6 unintended change rate. Beyond automated metrics, human evaluations on compared local-editing baselines support stronger alignment with user intent, and downstream navigation-proxy tests indicate stronger spatial stability. Combined with ablations validating our memory designs, these results establish MUSE as an effective framework for controllable 3D scene authoring.

翻译：文本驱动的三维场景生成是数字内容创作、具身人工智能仿真及交互设计中的一项具有前景的技术，然而实际工作流程往往需要在保留非目标内容的同时，对现有场景进行细化、扩展或修正。现有方法能够生成逼真且结构合理的场景，但通常缺乏基于需求层级的状态追踪编辑能力，因此局部故障常导致全场景重建或人工干预。为应对这一挑战，我们将可控三维场景创作形式化为增量需求满足过程，统一了场景构建与编辑操作。本文提出MUSE，一种基于记忆锚定的多智能体框架，其中架构师将指令编译为结构化需求，雕刻师执行局部场景操作，审查师在更新工作记忆、场景记忆和技能记忆的同时验证每个步骤。为评估需求层级的可控性与保留感知型编辑能力，我们引入AuthorBench基准，提供145个带约束的构建案例及包含1584个案例的保留感知型编辑池（配备外部结构化校验）。在完整构建案例上，MUSE相较于最强基线将全目标成功率从37.9提升至80.7，表面约束满足率从35.0提升至92.6。在分层抽样的240例编辑测试集上，MUSE实现49.6的全目标成功率、99.9的保留率及仅0.6的非预期修改率。除自动化指标外，针对局部编辑基线的对比人类评估展现出与用户意图更强的对齐效果，下游导航代理测试亦表明其具备更优的空间稳定性。结合验证我们记忆设计的消融实验，这些结果确立了MUSE作为可控三维场景创作的有效框架。