Vision-Language-Action (VLA) models are formulated to ground instructions in visual context and generate action sequences for robotic manipulation. Despite recent progress, VLA models still face challenges in learning related and reusable primitives, reducing reliance on large-scale data and complex architectures, and enabling exploration beyond demonstrations. To address these challenges, we propose a novel Neuro-Symbolic Vision-Language-Action (NS-VLA) framework via online reinforcement learning (RL). It introduces a symbolic encoder to embedding vision and language features and extract structured primitives, utilizes a symbolic solver for data-efficient action sequencing, and leverages online RL to optimize generation via expansive exploration. Experiments on robotic manipulation benchmarks demonstrate that NS-VLA outperforms previous methods in both one-shot training and data-perturbed settings, while simultaneously exhibiting superior zero-shot generalizability, high data efficiency and expanded exploration space. Our code is available.
翻译:视觉-语言-动作(VLA)模型旨在将指令基于视觉上下文进行接地,并为机器人操作生成动作序列。尽管近期取得进展,VLA模型仍面临以下挑战:学习相关且可重用的基元、减少对大规模数据和复杂架构的依赖,以及实现超越演示的探索。为应对这些挑战,我们提出一种新颖的神经符号视觉-语言-动作(NS-VLA)框架,通过在线强化学习(RL)实现。该框架引入符号编码器以嵌入视觉和语言特征并提取结构化基元,利用符号求解器实现数据高效的动作序列生成,并借助在线RL通过广泛探索优化生成过程。在机器人操作基准测试上的实验表明,NS-VLA在单次训练和数据扰动设置中均优于先前方法,同时展现出卓越的零样本泛化能力、高数据效率及扩展的探索空间。我们的代码已开源。