MV-WAM: Manifold-Aware World Action Model with Value Augmentation

Jintao Chen,Peidong Jia,Qingpo Wuwu,Jiaming Liu,Mengfei Du,Chun-Kai Fan,Xiaowei Chi,Hao Chen,Chengyu Bai,Zezhong Qian,Hao Wang,Jiajun Cao,Weishi Mi,Xiaozhu Ju,Jian Tang,Shanghang Zhang

from arxiv, 20 pages, 9 figures, 7 tables

Achieving robust and generalizable manipulation across diverse environments remains a fundamental challenge in embodied robotics. Recent world action models achieve strong in-domain performance, yet their gains do not extend proportionally to out-of-distribution scenarios. We attribute this to a structural mismatch between visual and action modalities, whose intrinsically heterogeneous manifolds cause joint optimization to disproportionately degrade action robustness under distribution shift. To address this, we propose MV-WAM, a novel end-to-end framework that jointly models visual prediction, action generation, and value estimation designed to effectively leverage video priors during both training and inference for enhanced action generalization. Key to this unification is a cross-modality causal mask that hierarchically grounds actions in predicted video frames and value function tokens in both modalities. To further narrow the generalization gap, MV-WAM adopts a manifold-aware optimization scheme that explicitly accounts for the structural heterogeneity across modalities. Finally, MV-WAM introduces a progress-value regulation mechanism that estimates task completion and detects misalignment between predicted frames and generated actions, enabling the policy to autonomously identify execution deviations and recover through value-guided rollback. On the RoboTwin simulation, MV-WAM achieves a 55.7% mean success rate on random scenarios without any randomized action supervision, outperforming the strongest baseline by 29.3%. MV-WAM achieves a 77.5% mean success rate across four real-world tasks of varying difficulty on a dual-arm robot. Our results demonstrate that manifold-aware cross-modal alignment is essential for robust policy generalization, offering a path toward deployable robotic manipulation.

翻译：暂无翻译

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

CVPR 2026 Findings | 算力砍半、性能不降！全开源 A₁模型：让机器人大模型真正走向落地

专知会员服务

15+阅读 · 4月12日

【博士论文】重新审视机器人安全性：面向真实世界自主运行的自适应与可扩展方法

专知会员服务

12+阅读 · 2月25日

《人机协作中的自适应任务规划与动态角色分配》最新30页报告

专知会员服务

27+阅读 · 2025年11月21日

《面向大语言模型引导规划、赌徒驱动探索与多智能体导航的分层决策》最新180页

专知会员服务

27+阅读 · 2025年11月17日