Robot Foundation Models such as Vision-Language-Action models are rapidly reshaping how robot policies are trained and deployed, replacing hand-designed planners with end-to-end generative action models. While these systems demonstrate impressive generalization, it remains unclear whether they fundamentally resolve the long-standing challenges of robotics. We address this question by analyzing action hallucinations that violate physical constraints and their extension to plan-level failures. Focusing on latent-variable generative policies, we show that hallucinations often arise from structural mismatches between feasible robot behavior and common model architectures. We study three such barriers -- topological, precision, and horizon -- and show how they impose unavoidable tradeoffs. Our analysis provides mechanistic explanations for reported empirical failures of generative robot policies and suggests principled directions for improving reliability and trustworthiness, without abandoning their expressive power.
翻译:视觉-语言-动作模型等机器人基础模型正迅速重塑机器人策略的训练与部署方式,以端到端的生成式动作模型取代了手工设计的规划器。尽管这些系统展现出令人印象深刻的泛化能力,但其是否从根本上解决了机器人学中长期存在的挑战仍不明确。本文通过分析违反物理约束的动作幻觉及其向规划层面失败的延伸来探讨这一问题。聚焦于潜变量生成策略,我们证明幻觉通常源于可行机器人行为与常见模型架构之间的结构性失配。我们研究了三种此类障碍——拓扑障碍、精度障碍与时间跨度障碍,并阐明它们如何施加不可避免的权衡。我们的分析为已报道的生成式机器人策略经验性失败提供了机制性解释,并为在不放弃其表达能力的条件下提升可靠性与可信度指明了原理性改进方向。