Vision-Language-Action (VLA) models have recently been proposed as a pathway toward generalist robotic policies capable of interpreting natural language and visual inputs to generate manipulation actions. However, their effectiveness and efficiency on structured, long-horizon manipulation tasks remain unclear. In this work, we present a head-to-head empirical comparison between a fine-tuned open-weight VLA model π0 and a neuro-symbolic architecture that combines PDDL-based symbolic planning with learned low-level control. We evaluate both approaches on structured variants of the Towers of Hanoi manipulation task in simulation while measuring both task performance and energy consumption during training and execution. On the 3-block task, the neuro-symbolic model achieves 95% success compared to 34% for the best-performing VLA. The neuro-symbolic model also generalizes to an unseen 4-block variant (78% success), whereas both VLAs fail to complete the task. During training, VLA fine-tuning consumes nearly two orders of magnitude more energy than the neuro-symbolic approach. These results highlight important trade-offs between end-to-end foundation-model approaches and structured reasoning architectures for long-horizon robotic manipulation, emphasizing the role of explicit symbolic structure in improving reliability, data efficiency, and energy efficiency. Code and models are available at https://price-is-not-right.github.io
翻译:视觉-语言-动作(VLA)模型近期被提出作为实现通用机器人策略的途径,其能够解析自然语言与视觉输入以生成操作动作。然而,其在结构化、长时程操作任务上的有效性与效率仍不明确。本研究对经过微调的开源权重VLA模型π0与一种结合基于PDDL的符号规划与学习型底层控制的神经符号架构进行了头对头实证比较。我们在仿真环境中基于结构化变体的汉诺塔操作任务评估了两种方法,同时测量了训练与执行期间的任务性能及能耗。在3积木任务中,神经符号模型取得了95%的成功率,而表现最佳的VLA模型仅为34%。神经符号模型还能泛化至未见过的4积木变体任务(成功率78%),而两种VLA模型均未能完成该任务。在训练过程中,VLA微调消耗的能源比神经符号方法高出近两个数量级。这些结果凸显了端到端基础模型方法与结构化推理架构在长时程机器人操作中面临的重要权衡,并强调了显式符号结构在提升可靠性、数据效率及能源效率方面的作用。代码与模型发布于 https://price-is-not-right.github.io