Recent advances in Vision-Language-Action (VLA) models have opened new avenues for robot manipulation, yet existing methods exhibit limited efficiency and a lack of high-level knowledge and spatial awareness. To address these challenges, we propose PokeVLA, a lightweight yet powerful foundation model for embodied manipulation that effectively infuses vision-language understanding into action learning. Our framework introduces a two-stage training paradigm: first, we pre-train a compact vision-language model (PokeVLM) on a curated multimodal dataset of 2.4M samples encompassing spatial grounding, affordance, and embodied reasoning tasks; second, we inject manipulation-relevant representations into the action space through multi-view goal-aware semantics learning, geometry alignment, and a novel action expert. Extensive experiments demonstrate state-of-the-art performance on the LIBERO-Plus benchmark and in real-world deployment, outperforming comparable baselines in success rate and robustness under diverse perturbations. To foster reproducibility and community progress, we will open-source our code, model weights, and the scripts for the curated pre-training dataset. Project page: https://getterupper.github.io/PokeVLA
翻译:近期视觉-语言-动作(VLA)模型的研究进展为机器人操作开辟了新路径,但现有方法仍存在效率不足、缺乏高层级知识与空间感知能力等问题。为解决这些挑战,我们提出PokeVLA——一种轻量级且功能强大的具身操作基础模型,能高效将视觉语言理解融入动作学习。我们的框架引入两阶段训练范式:首先,在包含空间定位、功能属性与具身推理任务、共计240万样本的精心构建多模态数据集上,预训练轻量级视觉语言模型(PokeVLM);其次,通过多视角目标感知语义学习、几何对齐及新型动作专家模块,将操作相关表征注入动作空间。大量实验表明,该方法在LIBERO-Plus基准测试及真实部署场景中均达到最优性能,在多种扰动条件下相比同类基线方法显著提升成功率与鲁棒性。为促进可复现性与社区发展,我们将开源代码、模型权重及所构建预训练数据集的相关脚本。项目主页:https://getterupper.github.io/PokeVLA