MolmoAct2: Action Reasoning Models for Real-world Deployment

Haoquan Fang,Jiafei Duan,Donovan Clay,Sam Wang,Shuo Liu,Weikai Huang,Xiang Fan,Wei-Chuan Tsai,Shirui Chen,Yi Ru Wang,Shanli Xing,Jaemin Cho,Jae Sung Park,Ainaz Eftekhar,Peter Sushko,Karen Farley,Angad Wadhwa,Cole Harrison,Winson Han,Ying-Chun Lee,Eli VanderBilt,Rose Hendrix,Suveen Ellawela,Lucas Ngoo,Joyce Chai,Zhongzheng Ren,Ali Farhadi,Dieter Fox,Ranjay Krishna

from arxiv, 31 pages, project page: https://allenai.org/blog/molmoact2

Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes. We introduce MolmoER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. We release three new datasets spanning low-to-medium cost platforms, including MolmoAct2-BimanualYAM, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date, together with quality-filtered Franka (DROID) and SO100/101 subsets. We provide OpenFAST, an open-weight, open-data action tokenizer trained on millions of trajectories across five embodiments. We redesign the architecture to graft a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning. Finally, we propose MolmoThink, an adaptive-depth reasoning variant that re-predicts depth tokens only for scene regions that change between timesteps, retaining geometric grounding at a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data. Project page: https://allenai.org/blog/molmoact2

翻译：视觉-语言-动作（VLA）模型旨在为机器人提供统一的通用控制器，但当前系统在实际部署所需的关键指标上仍存在不足。前沿模型处于封闭状态，开源权重方案受限于昂贵硬件，增强推理的策略因基础连接而面临高昂延迟代价，微调后的成功率仍低于可靠应用阈值。我们提出MolmoAct2——一款专为实际部署构建的完全开源动作推理模型，在五个维度上对其前身进行了提升。我们引入MolmoER，一个专为空间和具身推理优化的VLM主干网络，基于330万样本语料库并采用"专门化-再演练"方案训练。我们发布了三个面向低至中等成本平台的新数据集，包括MolmoAct2-BimanualYAM（720小时遥操作双臂轨迹，构成迄今最大规模的开源双臂数据集），以及质量过滤后的Franka（DROID）和SO100/101子集。我们提供OpenFAST——一个跨五种具身形态、基于数百万轨迹训练的开源权重与数据动作分词器。通过逐层KV缓存条件化处理，我们将架构重新设计为在离散标记VLM上嫁接流匹配连续动作专家模块。最后，我们提出MolmoThink，一种自适应深度推理变体，仅对时间步间发生变化的场景区域重新预测深度标记，以先前延迟的极小部分保留几何基础连接。在迄今针对任何开源VLA最广泛的实证研究中（涵盖7个模拟与真实世界基准），MolmoAct2优于包括Pi-05在内的强基线，而MolmoER在13个具身推理基准上超越GPT-5和Gemini Robotics ER-1.5。我们开源模型权重、训练代码及完整训练数据。项目页面：https://allenai.org/blog/molmoact2