Real-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes. We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., "pick up a bowl and place it on the table"), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity. The resulting environments are realistic, diverse, and directly deployable in modern simulators for policy training. Policies trained purely on this data exhibit clear scaling trends and generalize to unseen objects and layouts, demonstrating the promise of simulation-driven scaling for embodied AI. Code, demos, and the SAGE-10k dataset can be found on the project page here: https://nvlabs.github.io/sage.
翻译:为具身智能体收集真实世界数据仍然成本高昂且存在安全隐患,这催生了对于可扩展、高真实感且适配仿真器的三维环境的需求。然而,现有的场景生成系统通常依赖于基于规则或特定任务的流程,导致生成结果存在伪影且物理上无效。我们提出了SAGE,一个智能体框架,在给定用户指定的具身任务(例如,“拿起一个碗并将其放在桌子上”)后,能够理解任务意图并自动大规模生成可直接用于仿真的环境。该智能体将用于布局和物体组合的多个生成器与评估语义合理性、视觉真实感和物理稳定性的评判器相结合。通过迭代推理和自适应工具选择,它能够自我优化场景,直至满足用户意图和物理有效性。最终生成的环境具有高真实感、多样性,并且可直接部署于现代仿真器中用于策略训练。仅在此类数据上训练的策略展现出清晰的扩展趋势,并能泛化到未见过的物体和布局,这证明了仿真驱动扩展在具身智能领域的潜力。代码、演示和SAGE-10k数据集可在项目页面找到:https://nvlabs.github.io/sage。