Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes. We introduce MolmoER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. We release three new datasets spanning low-to-medium cost platforms, including MolmoAct2-BimanualYAM, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date, together with quality-filtered Franka (DROID) and SO100/101 subsets. We provide OpenFAST, an open-weight, open-data action tokenizer trained on millions of trajectories across five embodiments. We redesign the architecture to graft a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning. Finally, we propose MolmoThink, an adaptive-depth reasoning variant that re-predicts depth tokens only for scene regions that change between timesteps, retaining geometric grounding at a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data. Project page: https://allenai.org/blog/molmoact2
翻译:视觉-语言-动作(VLA)模型旨在为机器人提供统一的通用控制器,但当前系统在实际部署所需的关键指标上仍存在不足。前沿模型处于封闭状态,开源权重方案受限于昂贵硬件,增强推理的策略因基础连接而面临高昂延迟代价,微调后的成功率仍低于可靠应用阈值。我们提出MolmoAct2——一款专为实际部署构建的完全开源动作推理模型,在五个维度上对其前身进行了提升。我们引入MolmoER,一个专为空间和具身推理优化的VLM主干网络,基于330万样本语料库并采用"专门化-再演练"方案训练。我们发布了三个面向低至中等成本平台的新数据集,包括MolmoAct2-BimanualYAM(720小时遥操作双臂轨迹,构成迄今最大规模的开源双臂数据集),以及质量过滤后的Franka(DROID)和SO100/101子集。我们提供OpenFAST——一个跨五种具身形态、基于数百万轨迹训练的开源权重与数据动作分词器。通过逐层KV缓存条件化处理,我们将架构重新设计为在离散标记VLM上嫁接流匹配连续动作专家模块。最后,我们提出MolmoThink,一种自适应深度推理变体,仅对时间步间发生变化的场景区域重新预测深度标记,以先前延迟的极小部分保留几何基础连接。在迄今针对任何开源VLA最广泛的实证研究中(涵盖7个模拟与真实世界基准),MolmoAct2优于包括Pi-05在内的强基线,而MolmoER在13个具身推理基准上超越GPT-5和Gemini Robotics ER-1.5。我们开源模型权重、训练代码及完整训练数据。项目页面:https://allenai.org/blog/molmoact2