While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality Vlaser-6M dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks - including spatial reasoning, embodied grounding, embodied QA, and task planning. Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.
翻译:尽管已有大量研究致力于利用视觉-语言模型(VLMs)开发具身推理能力,或将先进的VLMs集成到视觉-语言-动作(VLA)模型中实现端到端的机器人控制,但很少有研究直接关注上游基于VLM的推理与下游VLA策略学习之间的关键鸿沟。在本工作中,我们通过引入Vlaser——一种具备协同具身推理能力的视觉-语言-动作模型,迈出了将具身推理与VLA策略学习相融合的初步一步。Vlaser是一个基础视觉-语言模型,旨在为具身智能体整合高层推理与低层控制。基于高质量的Vlaser-6M数据集,Vlaser在一系列具身推理基准测试中——包括空间推理、具身基础、具身问答和任务规划——实现了最先进的性能。此外,我们系统性地研究了不同VLM初始化如何影响监督式VLA微调,为缓解互联网规模预训练数据与具身专用策略学习数据之间的领域偏移提供了新的见解。基于这些见解,我们的方法在WidowX基准测试中取得了最先进的结果,并在Google Robot基准测试中表现出具有竞争力的性能。