Vision-language models (VLMs) are powerful general-purpose reasoners, yet converting them into robot control policies (VLAs) is surprisingly difficult. The root cause is a two-fold gap: VLMs are trained on internet-scale images with language-understanding objectives, while VLAs must perceive robot scenes and predict motor actions. Fine-tuning a VLM directly on robot action data forces the model to cross both gaps at once -- the learning curve is steep and the rich generalizations learned during pretraining tend to degrade rather than transfer. We argue that this gap can be bridged gradually with the right intermediate data. We introduce \emph{embodied trajectory-coupled (ETC) data} -- vision-language supervision derived from the same robot scenes and trajectories used for action learning. Because ETC data shares the visual context of robot operation while retaining familiar language-understanding objectives, it provides a natural stepping stone between VLM pretraining and VLA fine-tuning. Building on this, we design a three-stage training recipe. Distribution Bridging first adapts the VLM to embodied visual-language semantics. Objective Bridging then gradually shifts the model toward action prediction while preserving the acquired representations. Retentive Adaptation finally specializes the policy to the target deployment domain. We further show that mixing task-relevant out-of-distribution ETC data with a small amount of action data enables the model to generalize to novel visual-language conditions without requiring additional robot demonstrations. Simulation and real-robot experiments confirm that this gradual bridging strategy is the key to transferring VLM generalization into robust, deployable robot policies.
翻译:视觉-语言模型(VLM)是强大的通用推理器,但将其转化为机器人控制策略(VLA)却异常困难。根本原因在于双重鸿沟:VLM在互联网规模图像上以语言理解目标训练,而VLA必须感知机器人场景并预测运动动作。直接在机器人动作数据上微调VLM,迫使模型同时跨越两个鸿沟——学习曲线陡峭,且预训练阶段习得的丰富泛化能力往往退化而非迁移。我们认为,通过合适的中间数据可以逐步弥合这一差距。我们提出具身轨迹耦合(ETC)数据——一种源自与动作学习相同的机器人场景和轨迹的视觉-语言监督数据。由于ETC数据共享机器人操作的视觉上下文,同时保留熟悉的语言理解目标,它为VLM预训练与VLA微调之间提供了自然的过渡阶梯。基于此,我们设计了三阶段训练方案:分布桥接阶段首先使VLM适应具身视觉-语言语义;目标桥接阶段逐步引导模型转向动作预测,同时保持已习得的表征;保留适配阶段最终将策略专用于目标部署领域。我们进一步证明,混合任务相关的分布外ETC数据与少量动作数据,能使模型泛化到新颖的视觉-语言条件,无需额外机器人演示。仿真与真实机器人实验证实,这种渐进式桥接策略是将VLM泛化能力转化为鲁棒、可部署机器人策略的关键。