Recent advances in Vision-Language-Action (VLA) models have opened new avenues for robot manipulation, yet existing methods exhibit limited efficiency and a lack of high-level knowledge and spatial awareness. To address these challenges, we propose PokeVLA, a lightweight yet powerful foundation model for embodied manipulation that effectively infuses vision-language understanding into action learning. Our framework introduces a two-stage training paradigm: first, we pre-train a compact vision-language model (PokeVLM) on a curated multimodal dataset of 2.4M samples encompassing spatial grounding, affordance, and embodied reasoning tasks; second, we inject manipulation-relevant representations into the action space through multi-view goal-aware semantics learning, geometry alignment, and a novel action expert. Extensive experiments demonstrate state-of-the-art performance on the LIBERO-Plus benchmark and in real-world deployment, outperforming comparable baselines in success rate and robustness under diverse perturbations. To foster reproducibility and community progress, we will open-source our code, model weights, and the scripts for the curated pre-training dataset. Project page: https://getterupper.github.io/PokeVLA
翻译:近年来,视觉-语言-动作(VLA)模型的最新进展为机器人操控开辟了新途径,然而现有方法在效率上存在局限,且缺乏高层级知识与空间感知能力。为应对这些挑战,我们提出PokeVLA——一种轻量级但强大的具身操控基础模型,有效将视觉语言理解融入动作学习。我们的框架引入两阶段训练范式:首先,在包含240万样本的精心策划多模态数据集上预训练紧凑型视觉语言模型(PokeVLM),该数据集涵盖空间定位、可供性和具身推理任务;其次,通过多视角目标导向语义学习、几何对齐及新型动作专家模块,将操控相关表征注入动作空间。大量实验表明,该模型在LIBERO-Plus基准测试及真实场景部署中均取得最先进性能,在成功率与多类扰动下的鲁棒性方面全面超越同类基线。为促进可重复性与社区发展,我们将开源代码、模型权重及所策划预训练数据集的构建脚本。项目页面:https://getterupper.github.io/PokeVLA