Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training

Post-training is essential for turning pretrained generalist robot policies into reliable task-specific controllers, but existing human-in-the-loop pipelines remain tied to physical execution: each correction requires robot time, scene setup, resets, and operator supervision in the real world. Meanwhile, action-conditioned world models have been studied mainly for imagination, synthetic data generation, and policy evaluation. We propose \textbf{Human-in-the-World-Model (Hi-WM)}, a post-training framework that uses a learned world model as a reusable corrective substrate for failure-targeted policy improvement. A policy is first rolled out in closed loop inside the world model; when the rollout becomes incorrect or failure-prone, a human intervenes directly in the model to provide short corrective actions. Hi-WM caches intermediate states and supports rollback and branching, allowing a single failure state to be reused for multiple corrective continuations and yielding dense supervision around behaviors that the base policy handles poorly. The resulting corrective trajectories are then added back to the training set for post-training. We evaluate Hi-WM on three real-world manipulation tasks spanning both rigid and deformable object interaction, and on two policy backbones. Hi-WM improves real-world success by 37.9 points on average over the base policy and by 19.0 points over a world-model closed-loop baseline, while world-model evaluation correlates strongly with real-world performance (r = 0.953). These results suggest that world models can serve not only as generators or evaluators, but also as effective corrective substrates for scalable robot post-training.

翻译：后训练是将预训练通用机器人策略转化为可靠任务专用控制器的关键环节，但现有的人机交互流水线仍依赖物理执行：每次修正都需要机器人运行时间、场景搭建、状态重置以及操作员在真实环境中的监督。与此同时，动作条件世界模型此前主要被研究用于想象过程、合成数据生成和策略评估。我们提出**人-世界-模型（Hi-WM）**——一种利用学习到的世界模型作为可复用修正基质的后训练框架，用于针对失败目标的策略改进。首先，策略在世界模型内部以闭环方式展开；当展开过程出现错误或易失败时，人类直接在模型中干预并提供简短修正动作。Hi-WM缓存中间状态并支持回滚与分支，使得单个失败状态可被重复用于多个修正延续，从而在基础策略处理不佳的行为周围产生密集监督信号。最终生成的修正轨迹被重新添加至训练集进行后训练。我们在三个涵盖刚体与可变形物体交互的真实世界操作任务以及两个策略主干上评估了Hi-WM。相较于基础策略，Hi-WM将真实世界成功率平均提升37.9个百分点；相较于世界模型闭环基线，提升19.0个百分点；同时世界模型评估结果与真实世界性能高度相关（相关系数r=0.953）。这些结果表明，世界模型不仅可作为生成器或评估器，更能成为可扩展机器人后训练的有效修正基质。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

综述 | 机器人操作世界模型：预测、行动接口与学习生命周期

专知会员服务

10+阅读 · 6月3日

【综述】机器人学习中的世界模型：全面综述

专知会员服务

20+阅读 · 5月4日

【伯克利博士论文】物理世界中可泛化且可扩展的机器人学习

专知会员服务

22+阅读 · 1月18日

【斯坦福博士论文】基础模型后训练的新方法

专知会员服务

25+阅读 · 2025年11月8日