Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns. To resolve these misalignments, we propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space. By introducing discrete pose tokens as a universal representation, Pose-VLA seamlessly integrates spatial grounding from diverse 3D datasets with geometry-level trajectories from robotic demonstrations. Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment through trajectory supervision. Extensive evaluations demonstrate that Pose-VLA achieves state-of-the-art results on RoboTwin 2.0 with a 79.5% average success rate and competitive performance on LIBERO at 96.0%. Real-world experiments further showcase robust generalization across diverse objects using only 100 demonstrations per task, validating the efficiency of our pre-training paradigm.
翻译:现有的视觉-语言-动作模型常因将高层感知与稀疏、具身特定的动作监督相耦合,而面临特征坍缩与训练效率低下的问题。由于这些模型通常依赖针对视觉问答任务优化的视觉语言模型骨干网络,其虽擅长语义识别,却往往忽略决定不同动作模式的细微三维状态变化。为解决上述错位问题,我们提出Pose-VLA——一种解耦范式,将视觉-语言-动作训练分离为两个阶段:在统一以相机为中心的空间中提取通用三维空间先验的预训练阶段,以及在机器人特定动作空间内进行高效具身对齐的后训练阶段。通过引入离散姿态标记作为通用表征,Pose-VLA能够无缝整合来自多样化三维数据集的空间定位信息与机器人演示中的几何级轨迹数据。我们的框架采用两阶段预训练流程:首先通过姿态建立基础空间定位,随后通过轨迹监督实现运动对齐。大量实验表明,Pose-VLA在RoboTwin 2.0上以79.5%的平均成功率取得最优性能,在LIBERO基准上亦达到96.0%的竞争性表现。真实世界实验进一步证明,仅需每任务100条演示数据,该框架即可在多样化物体上实现鲁棒泛化,验证了我们预训练范式的效率优势。