Generating simulation-ready tabletop scenes from task instructions is an intriguing and promising research direction in the field of Embodied AI. However, existing task-to-scene generation methods rely exclusively on large language models (LLMs) to predict scene layouts, inevitably yielding object collisions or floating due to LLMs' inherent limitations in 3D spatial reasoning. In this paper, we present STABLE, a semantics-physics dual-system tailored for simulation-ready tabletop scene generation. STABLE consists of two complementary modules: (i) a Semantic Reasoner, a fine-tuned LLM trained on a structured tabletop scene dataset to generate coarse layouts from input task instructions, and (ii) a Physics Corrector, a physics-aware flow-based denoising model that outputs pose updates to refine layouts, which ensures the physical plausibility of scenes while preserves semantic alignment with task instructions. STABLE adopts a progressive generation paradigm: by alternating between the Semantic Reasoner and Physics Corrector, it incrementally expands the scene from task-critical objects to background objects. Experiments demonstrate that STABLE successfully generates simulation-ready tabletop scenes that strictly conform to task instructions and significantly enhances the physical validity of scenes over prior art.
翻译:从任务指令生成仿真就绪的桌面场景是具身智能领域中一个引人入胜且前景广阔的研究方向。然而,现有的任务到场景生成方法完全依赖大型语言模型来预测场景布局,由于大型语言模型在三维空间推理方面的固有局限性,不可避免地导致物体碰撞或悬浮。本文提出了STABLE,这是一种专为仿真就绪桌面场景生成量身定制的语义-物理双系统。STABLE由两个互补模块组成:(i) 语义推理器,一个在结构化桌面场景数据集上微调的大型语言模型,用于从输入任务指令中生成粗略布局;(ii) 物理校正器,一个基于物理感知流的去噪模型,输出位姿更新以优化布局,该模型在确保场景物理合理性的同时,保持与任务指令的语义对齐。STABLE采用渐进式生成范式:通过在语义推理器和物理校正器之间交替迭代,从任务关键物体逐步扩展到背景物体。实验表明,STABLE成功生成了严格符合任务指令的仿真就绪桌面场景,并且在场景物理有效性方面显著超越了现有技术。