General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning which primarily focuses on optimal actions, a world model needs to be reliable over a vast space of suboptimal actions, which are often underrepresented in action-labeled robot interactions. To address this challenge, we propose World Action Verifier (WAV), a framework that enables world models to identify their own prediction errors and self-improve. The key idea is to decompose action-conditioned state prediction into two independently verifiable factors: state plausibility and action reachability. We show that verifying these factors is significantly more tractable than direct forward prediction due to two underlying asymmetries: the broader availability of action-free data and the lower dimensionality of action-relevant features. Leveraging these asymmetries, we augment a world model with (i) a diverse subgoal generator obtained from video corpora and (ii) a sparse inverse model that infers actions from a subset of state features. By enforcing cycle consistency among proposed subgoals, inferred actions, and forward rollouts, WAV provides an effective verification mechanism in under-explored regimes, where existing methods often fail. Across nine tasks spanning MiniGrid, RoboMimic, and ManiSkill, our method achieves 2x higher sample efficiency while improving downstream policy performance by over 22%.
翻译:通用世界模型有望实现可扩展的策略评估、优化与规划,然而达到所需的鲁棒性仍然面临挑战。与主要关注最优动作的策略学习不同,世界模型需要在大量次优动作构成的广阔空间中保持可靠性,而这些次优动作在带有动作标签的机器人交互数据中往往代表性不足。为解决这一问题,我们提出世界动作验证器(WAV)框架,该框架使世界模型能够识别自身预测错误并进行自我改进。其核心思想是将动作条件的状态预测分解为两个可独立验证的因子:状态合理性与动作可达性。研究表明,由于两种潜在的不对称性——无动作数据的更广泛可用性以及动作相关特征的更低维度——验证这些因子比直接进行前向预测更为可行。利用这些不对称性,我们通过两种方式增强世界模型:(i)从视频语料库中获取的多样化子目标生成器,以及(ii)从状态特征子集中推断动作的稀疏逆模型。通过强制所提议的子目标、推断动作与前向展开之间的循环一致性,WAV在现有方法常失效的探索不足区域提供了有效的验证机制。在涵盖MiniGrid、RoboMimic和ManiSkill的九个任务中,我们的方法实现了2倍的样本效率提升,同时下游策略性能改进超过22%。