Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

Haoqi Yuan,Zhixuan Liang,Anzhe Chen,Ye Wang,Haoyang Li,Pei Lin,Yiyang Huang,Zixing Lei,Tong Zhang,Jiazhao Zhang,Jie Zhang,Jingyang Fan,Gengze Zhou,Qihang Peng,Chenxu Lv,Xiaoyue Chen,An Yang,Fei Huang,Junyang Lin,Dayiheng Liu,Jingren Zhou,Chenfei Wu,Xiong-Hui Chen

from arxiv, 44 pages

Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including $π$0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.

翻译：[翻译摘要] 语言和多模态领域的基础模型通过统一框架对齐异构数据并进行大规模训练，实现了强大的泛化能力。本报告探究这一规模化方法是否可应用于机器人操作领域以实现真正的泛化。该任务面临特殊挑战：不同于文本数据，操作数据本质上具有异构性，采集成本高昂且多样性有限，导致对齐与规模化难以兼顾。我们提出基于Qwen-VL构建的可泛化视觉-语言-动作基础模型Qwen-RobotManip。该模型在表征、运动和行为维度引入统一的对齐框架，使大规模多源训练从相互冲突转变为协同一致。这种对齐能力使得Qwen-RobotManip能够吸收以往训练方案无法承载的大规模操作数据。我们设计的人机合成流水线将第一人称手部演示转化为15种平台上的机器人轨迹，并通过严格的数据筛选流程整合异构数据集。仅使用开源数据集和人类视频（无需专有数据采集），Qwen-RobotManip构建了约38,100小时的预训练语料库，展现出零样本指令跟随、扰动鲁棒性、误差反应式恢复及跨本体迁移等涌现泛化能力。研究发现标准基准无法反映预训练质量，因此采用包含RoboCasa365、LIBERO-Plus、EBench、RoboTwin-Clean2Rand、RoboTwin-IF和RoboTwin-XE的域外测试场景。在所有域外设定下，Qwen-RobotManip均显著超越包括π0.5在内的先前最优模型，在RoboChallenge中排名第一且相对性能提升20%，并已在AgileX ALOHA、Franka、UR和ARX等真实机器人平台上完成验证。