Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training

Post-training is essential for turning pretrained generalist robot policies into reliable task-specific controllers, but existing human-in-the-loop pipelines remain tied to physical execution: each correction requires robot time, scene setup, resets, and operator supervision in the real world. Meanwhile, action-conditioned world models have been studied mainly for imagination, synthetic data generation, and policy evaluation. We propose \textbf{Human-in-the-World-Model (Hi-WM)}, a post-training framework that uses a learned world model as a reusable corrective substrate for failure-targeted policy improvement. A policy is first rolled out in closed loop inside the world model; when the rollout becomes incorrect or failure-prone, a human intervenes directly in the model to provide short corrective actions. Hi-WM caches intermediate states and supports rollback and branching, allowing a single failure state to be reused for multiple corrective continuations and yielding dense supervision around behaviors that the base policy handles poorly. The resulting corrective trajectories are then added back to the training set for post-training. We evaluate Hi-WM on three real-world manipulation tasks spanning both rigid and deformable object interaction, and on two policy backbones. Hi-WM improves real-world success by 37.9 points on average over the base policy and by 19.0 points over a world-model closed-loop baseline, while world-model evaluation correlates strongly with real-world performance (r = 0.953). These results suggest that world models can serve not only as generators or evaluators, but also as effective corrective substrates for scalable robot post-training.

翻译：后训练是将预训练的通用机器人策略转化为可靠任务专用控制器的关键环节，但现有的人机交互流程仍受限于物理执行环节：每次修正都需要在真实世界中占用机器人时间、进行场景设置、重置操作以及操作员监督。与此同时，动作条件世界模型此前主要被研究用于想象、合成数据生成和策略评估。我们提出**人机交互世界模型（Hi-WM）**，一种将学习到的世界模型作为可复用的修正基座，用于面向失败目标的策略改进的后训练框架。首先在闭环世界模型内部部署策略；当策略展开出现错误或易失败时，人类直接在世界模型中干预并提供简短的修正动作。Hi-WM 缓存中间状态并支持回滚与分支，使单个失败状态可被复用于多个修正延续，从而在基础策略处理效果不佳的行为上生成密集的监督信号。最终将得到的修正轨迹重新加入训练集以进行后训练。我们在包含刚体和可变形物体交互的三项真实世界操作任务及两种策略主干上对 Hi-WM 进行了评估。相较于基础策略，Hi-WM 平均提升真实世界成功率 37.9 个百分点，相较于世界模型闭环基线提升 19.0 个百分点，同时世界模型评估与真实世界性能高度相关（r = 0.953）。这些结果表明，世界模型不仅可作为生成器或评估器，更可作为可复用的修正基座，支持可扩展的机器人后训练。