MagicSim: A Unified Infrastructure for Executable Embodied Interaction

Haoran Lu,Songling Liu,Yue Chen,Guo Ye,Mutian Shen,Shuyang Yu,Yu Xiao,Jihai Zhao,Shang Wu,Jianshu Zhang,Xiangtian Gui,Chuye Hong,Yuran Wang,Maojiang Su,Jiayi Wang,Ruihai Wu,Zhaoran Wang,Han Liu

Robot learning and embodied agents now require simulation to serve as a shared execution substrate linking control, skills, and planning, not only as a renderer, controller testbed, or fixed task environment. Existing pipelines split these layers with "magic" actions, disconnected training environments, or forward-only renders that cannot reproduce, evaluate, and annotate the same episode. We present MagicSim, an embodied interaction infrastructure built around one deterministic batched runtime and a shared Markov decision process (MDP). From YAML-first specifications that decouple contents, placement, behavior, and agent exposure, MagicSim constructs diverse executable worlds spanning task families, interaction regimes, physics, layouts, sensors, avatars, and robot embodiments in one reset-and-step loop. A common execution interface grounds high-level commands through controllers, atomicskills, planner primitives, and asynchronous planning, realizing them as robot actions rather than simulator-side state edits. One task definition supports three capabilities: benchmark and RL evaluation, an autocollect interface that automatically turns commands into grounded trajectories, and agent/VLM-facing interaction. For automatic execution, commands flow through a Command->Skill->Planner->Robot->Record pipeline, while per-environment command, skill, planning, retry, annotation, and episode states advance independently above the shared physics tick. Successful rollouts are saved as structured multimodal trajectories aligning language supervision, action representations, visual/geometric representations, and task-level status with the executed episode. MagicSim thus unifies diverse world construction, embodied execution, task evaluation, automatic rollout generation, and interactive agent interfaces in one planner-in-the-loop runtime.

翻译：机器人学习与具身智能体如今要求仿真不仅作为渲染器、控制器测试平台或固定任务环境，更需成为连接控制、技能与规划的共享执行基础。现有流程通过"魔法"行为、分离的训练环境或仅能前向渲染（无法复现、评估和标注同一片段）的渲染器，将各层级割裂开来。我们提出MagicSim——一个围绕确定性批处理运行时与共享马尔可夫决策过程（MDP）构建的具身交互基础设施。基于YAML优先的规范（解耦内容、布局、行为与智能体暴露），MagicSim通过单一重置-步进循环，构建涵盖任务家族、交互模式、物理规律、空间布局、传感器、数字人及机器人形态的多样化可执行世界。其通用执行接口通过控制器、原子技能、规划器原语与异步规划将高层指令具身化，以机器人动作而非模拟器状态变更的形式实现指令。每个任务定义支持三种能力：基准测试与强化学习评估、自动收集接口（将指令自动转化为具身轨迹），以及面向智能体/视觉语言模型（VLM）的交互。自动执行时，指令经"指令→技能→规划器→机器人→记录"流水线流转；在共享物理时钟之上，各环境的指令、技能、规划、重试、标注与回合状态独立演进。成功轨迹被保存为结构化多模态数据，将语言监督、动作表征、视觉/几何表征及任务层级状态与执行片段对齐。MagicSim由此在单一规划器在环运行时中，统一了多样化世界构建、具身执行、任务评估、自动轨迹生成与交互式智能体接口。