Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stages$\unicode{x2013}$from architectural layout to furniture placement to small object population$\unicode{x2013}$each implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects, and physical property estimation. SceneSmith generates 3-6x more objects than prior methods, with <2% inter-object collisions and 96% of objects remaining stable under physics simulation. In a user study with 205 participants, it achieves 92% average realism and 91% average prompt faithfulness win rates against baselines. We further demonstrate that these environments can be used in an end-to-end pipeline for automatic robot policy evaluation.
翻译:仿真已成为大规模训练和评估家庭机器人的关键工具,然而现有环境未能捕捉真实室内空间的多样性和物理复杂性。当前场景合成方法生成的房间家具稀疏,缺乏机器人操作所必需的密集杂物、可动家具及物理属性。本文提出SceneSmith,一种层次化智能体框架,能够根据自然语言提示生成可直接用于仿真的室内环境。SceneSmith通过连续阶段构建场景——从建筑布局到家具摆放再到小物件布置——每个阶段均通过视觉语言模型智能体(设计器、评审器与协调器)的交互实现。该框架紧密集成了静态物体的文生三维合成、可动物体的数据集检索以及物理属性估计等资产生成流程。SceneSmith生成的对象数量达到现有方法的3-6倍,物体间碰撞率低于2%,且在物理仿真中96%的物体保持稳定。在205名参与者的用户研究中,相较于基线方法,其平均真实感胜率达到92%,平均提示遵循度胜率达到91%。我们进一步证明,这些环境可用于端到端的机器人策略自动评估流程。