Text-driven indoor scene generation and editing require an intermediate representation that language models can both produce and revise. Existing LLM-based systems often rely on scene graphs or global constraint lists, which are compact but underspecify local geometry and make instruction-based edits difficult to localize. We frame this problem as structured program generation and local program repair, and propose Hierarchical Descriptive Scene Language (HDSL), an XML/CSS-style domain-specific language for structured 3D indoor scenes. HDSL represents rooms, regions, objects, and support surfaces as a tree with local coordinates, making complex scenes easier to plan recursively and easier to retrieve for editing. Our pipeline uses LLM agents to generate HDSL subtrees with bounded verification, grounds non-virtual nodes through multimodal asset retrieval, and applies force-directed layout optimization to repair boundary and collision errors. For editing, Hierarchical Retrieval-Augmented Generation retrieves the relevant subtree, asks the LLM to rewrite only that local context, and merges the result back through a deterministic three-way merge. In our reproduced benchmark, HDSL improves average object coverage, text-scene alignment, and generation time over full text-to-scene baselines while remaining competitive with recent layout-only reproductions on geometry metrics; for editing, HRAG reduces token use by $5.22\times$ and runtime by $6.19\times$, produces valid DSL for all eight paired edits, and better preserves unrelated scene objects.
翻译:文本驱动的室内场景生成与编辑需要一种中间表示形式,使语言模型既能生成又能修改。现有基于大语言模型(LLM)的系统通常依赖场景图或全局约束列表,这些表示虽简洁,但无法充分描述局部几何结构,且难以基于指令定位编辑区域。我们将该问题建模为结构化程序生成与局部程序修复,并提出层次化描述性场景语言(HDSL),这是一种基于XML/CSS风格的领域特定语言,用于结构化3D室内场景。HDSL将房间、区域、物体及其支撑面表示为包含局部坐标的树形结构,从而便于递归规划复杂场景,并增强编辑过程中的检索效率。我们的流水线利用LLM智能体生成带边界验证的HDSL子树,通过多模态资产检索对非虚拟节点进行实体化,并采用力导向布局优化修复边界及碰撞错误。在编辑阶段,层次化检索增强生成(HRAG)机制提取相关子树,仅要求LLM重写该局部上下文,再通过确定性三方合并将结果融合回原场景。在复现基准测试中,相比纯文本到场景的基线方法,HDSL在平均物体覆盖率、文本-场景对齐度及生成时间上均有提升,且在几何指标上与近期仅依赖布局的复现方法保持竞争力;对于编辑任务,HRAG的令牌消耗降低5.22倍,运行时间缩短6.19倍,可针对全部八组配对编辑生成有效DSL,并更好保留无关场景物体。