Vision-Language-Action (VLA) models have emerged as a promising paradigm for generalist robotic manipulation. A common design in current architectures maps language instructions and visual observations to actions in a single forward pass. While conceptually simple, this formulation entangles instruction comprehension, spatial scene understanding, and motor control within a single learning objective. As a result, the action expert must implicitly relearn cognitive and perceptual capabilities already present in the pretrained VLM, which can limit both learning efficiency and generalization. We introduce AVP (Action with Visual Primitives), an end-to-end architecture that implements this visual-primitive-centric interface: the VLM infers the next-stage target and emits visual-primitive tokens that condition a flow-matching action expert, with supervision derived from end-effector kinematics. Real-robot experiments on general pick-and-place tasks show that AVP improves the success rate by 37.04% over pi_0.5 and outperforms other recent methods, with consistent gains in data efficiency, spatial-compositional generalization, and object-level transfer.
翻译:视觉-语言-动作(VLA)模型已成为通用机器人操作领域的一种有前景的范式。当前架构的常见设计是将语言指令和视觉观测映射为单次前向传递中的动作。尽管概念上简洁,但这种表述将指令理解、空间场景感知和运动控制纠缠在单一学习目标中。因此,动作专家需要隐式地重新学习预训练VLM中已具备的认知和感知能力,这可能会限制学习效率和泛化能力。我们提出AVP(基于视觉基元的动作生成),一种实现该视觉基元为中心接口的端到端架构:VLM推断下一阶段目标并生成视觉基元token,这些token用于条件化流匹配动作专家,其监督信号源自末端执行器运动学。在通用抓取放置任务上的实物机器人实验表明,AVP相比pi_0.5将成功率提升了37.04%,并优于其他近期方法,在数据效率、空间组合泛化和物体级迁移方面均取得了一致性提升。