Robot learning and embodied agents now require simulation to serve as a shared execution substrate linking control, skills, and planning, not only as a renderer, controller testbed, or fixed task environment. Existing pipelines split these layers with "magic" actions, disconnected training environments, or forward-only renders that cannot reproduce, evaluate, and annotate the same episode. We present MagicSim, an embodied interaction infrastructure built around one deterministic batched runtime and a shared Markov decision process (MDP). From YAML-first specifications that decouple contents, placement, behavior, and agent exposure, MagicSim constructs diverse executable worlds spanning task families, interaction regimes, physics, layouts, sensors, avatars, and robot embodiments in one reset-and-step loop. A common execution interface grounds high-level commands through controllers, atomicskills, planner primitives, and asynchronous planning, realizing them as robot actions rather than simulator-side state edits. One task definition supports three capabilities: benchmark and RL evaluation, an autocollect interface that automatically turns commands into grounded trajectories, and agent/VLM-facing interaction. For automatic execution, commands flow through a Command->Skill->Planner->Robot->Record pipeline, while per-environment command, skill, planning, retry, annotation, and episode states advance independently above the shared physics tick. Successful rollouts are saved as structured multimodal trajectories aligning language supervision, action representations, visual/geometric representations, and task-level status with the executed episode. MagicSim thus unifies diverse world construction, embodied execution, task evaluation, automatic rollout generation, and interactive agent interfaces in one planner-in-the-loop runtime.
翻译:机器人学习与具身智能体如今要求仿真不仅作为渲染器、控制器测试平台或固定任务环境,更需成为连接控制、技能与规划的共享执行基础。现有流程通过"魔法"行为、分离的训练环境或仅能前向渲染(无法复现、评估和标注同一片段)的渲染器,将各层级割裂开来。我们提出MagicSim——一个围绕确定性批处理运行时与共享马尔可夫决策过程(MDP)构建的具身交互基础设施。基于YAML优先的规范(解耦内容、布局、行为与智能体暴露),MagicSim通过单一重置-步进循环,构建涵盖任务家族、交互模式、物理规律、空间布局、传感器、数字人及机器人形态的多样化可执行世界。其通用执行接口通过控制器、原子技能、规划器原语与异步规划将高层指令具身化,以机器人动作而非模拟器状态变更的形式实现指令。每个任务定义支持三种能力:基准测试与强化学习评估、自动收集接口(将指令自动转化为具身轨迹),以及面向智能体/视觉语言模型(VLM)的交互。自动执行时,指令经"指令→技能→规划器→机器人→记录"流水线流转;在共享物理时钟之上,各环境的指令、技能、规划、重试、标注与回合状态独立演进。成功轨迹被保存为结构化多模态数据,将语言监督、动作表征、视觉/几何表征及任务层级状态与执行片段对齐。MagicSim由此在单一规划器在环运行时中,统一了多样化世界构建、具身执行、任务评估、自动轨迹生成与交互式智能体接口。