Generating simulation-ready tabletop scenes from task instructions is an intriguing and promising research direction in the field of Embodied AI. However, existing task-to-scene generation methods rely exclusively on large language models (LLMs) to predict scene layouts, inevitably yielding object collisions or floating due to LLMs' inherent limitations in 3D spatial reasoning. In this paper, we present STABLE, a semantics-physics dual-system tailored for simulation-ready tabletop scene generation. STABLE consists of two complementary modules: (i) a Semantic Reasoner, a fine-tuned LLM trained on a structured tabletop scene dataset to generate coarse layouts from input task instructions, and (ii) a Physics Corrector, a physics-aware flow-based denoising model that outputs pose updates to refine layouts, which ensures the physical plausibility of scenes while preserves semantic alignment with task instructions. STABLE adopts a progressive generation paradigm: by alternating between the Semantic Reasoner and Physics Corrector, it incrementally expands the scene from task-critical objects to background objects. Experiments demonstrate that STABLE successfully generates simulation-ready tabletop scenes that strictly conform to task instructions and significantly enhances the physical validity of scenes over prior art.
翻译:从任务指令生成仿真就绪的桌面场景是具身智能领域中一个引人入胜且前景广阔的研究方向。然而,现有的任务到场景生成方法仅依赖大型语言模型来预测场景布局,由于LLMs在三维空间推理方面固有的局限性,不可避免地会产生物体碰撞或悬浮现象。本文提出STABLE,一种专为仿真就绪桌面场景生成定制的语义-物理双系统。STABLE由两个互补模块组成:(i) 语义推理器,一个基于结构化桌面场景数据集微调的LLM,用于从输入任务指令生成粗略布局;(ii) 物理校正器,一个物理感知的基于流形的去噪模型,输出位姿更新以优化布局,在确保场景物理合理性的同时保持与任务指令的语义对齐。STABLE采用渐进式生成范式:通过交替运行语义推理器和物理校正器,从任务关键物体逐步扩展至背景物体。实验表明,STABLE成功生成了严格符合任务指令的仿真就绪桌面场景,并在场景物理有效性方面显著超越现有技术。