Post-training is essential for turning pretrained generalist robot policies into reliable task-specific controllers, but existing human-in-the-loop pipelines remain tied to physical execution: each correction requires robot time, scene setup, resets, and operator supervision in the real world. Meanwhile, action-conditioned world models have been studied mainly for imagination, synthetic data generation, and policy evaluation. We propose \textbf{Human-in-the-World-Model (Hi-WM)}, a post-training framework that uses a learned world model as a reusable corrective substrate for failure-targeted policy improvement. A policy is first rolled out in closed loop inside the world model; when the rollout becomes incorrect or failure-prone, a human intervenes directly in the model to provide short corrective actions. Hi-WM caches intermediate states and supports rollback and branching, allowing a single failure state to be reused for multiple corrective continuations and yielding dense supervision around behaviors that the base policy handles poorly. The resulting corrective trajectories are then added back to the training set for post-training. We evaluate Hi-WM on three real-world manipulation tasks spanning both rigid and deformable object interaction, and on two policy backbones. Hi-WM improves real-world success by 37.9 points on average over the base policy and by 19.0 points over a world-model closed-loop baseline, while world-model evaluation correlates strongly with real-world performance (r = 0.953). These results suggest that world models can serve not only as generators or evaluators, but also as effective corrective substrates for scalable robot post-training.
翻译:后训练是将预训练通用机器人策略转化为可靠任务专用控制器的关键环节,但现有的人机交互流水线仍依赖物理执行:每次修正都需要机器人运行时间、场景搭建、状态重置以及操作员在真实环境中的监督。与此同时,动作条件世界模型此前主要被研究用于想象过程、合成数据生成和策略评估。我们提出**人-世界-模型(Hi-WM)**——一种利用学习到的世界模型作为可复用修正基质的后训练框架,用于针对失败目标的策略改进。首先,策略在世界模型内部以闭环方式展开;当展开过程出现错误或易失败时,人类直接在模型中干预并提供简短修正动作。Hi-WM缓存中间状态并支持回滚与分支,使得单个失败状态可被重复用于多个修正延续,从而在基础策略处理不佳的行为周围产生密集监督信号。最终生成的修正轨迹被重新添加至训练集进行后训练。我们在三个涵盖刚体与可变形物体交互的真实世界操作任务以及两个策略主干上评估了Hi-WM。相较于基础策略,Hi-WM将真实世界成功率平均提升37.9个百分点;相较于世界模型闭环基线,提升19.0个百分点;同时世界模型评估结果与真实世界性能高度相关(相关系数r=0.953)。这些结果表明,世界模型不仅可作为生成器或评估器,更能成为可扩展机器人后训练的有效修正基质。