Vision-Language-Action (VLA) models map multimodal inputs directly to robot actions and are typically trained through large-scale imitation learning. While this paradigm has shown strong performance, prevailing VLA training procedures do not explicitly supervise hard physical constraints such as obstacle avoidance or kinematic feasibility. As a result, the geometric structure underlying physically feasible behavior must be inferred only implicitly from demonstrations. In this paper, we study whether introducing explicit feasibility supervision can provide effective structured guidance for VLA policies. We formulate a simple geometry-grounded feasibility objective and integrate it into the training stage of a diffusion-based VLA policy. To evaluate this idea systematically, we use obstacle-aware manipulation as a controlled probe of geometry-dependent physical feasibility. Empirical results show that augmenting VLA training with feasibility supervision improves both physical reliability and overall task performance, while also enhancing learning efficiency in the low-data regime. These findings indicate that explicit feasibility signals can effectively complement imitation-based VLA learning, highlighting their potential for developing more reliable VLA policies.
翻译:视觉-语言-动作(Vision-Language-Action,VLA)模型将多模态输入直接映射到机器人动作,通常通过大规模模仿学习进行训练。尽管这种范式已展现出强大性能,但现有的VLA训练流程并未显式监督障碍规避或运动学可行性等硬性物理约束。因此,物理可行行为背后的几何结构只能从演示中隐式推断。本文研究了引入显式可行性监督能否为VLA策略提供有效的结构化指导。我们制定了一个简单的基于几何的可行性目标,并将其集成到基于扩散的VLA策略的训练阶段中。为系统评估这一想法,我们使用障碍感知操作作为受控探针来测试几何相关的物理可行性。实验结果表明,用可行性监督增强VLA训练可提升物理可靠性和整体任务性能,同时在数据稀缺情况下提高学习效率。这些发现表明,显式可行性信号能有效补充基于模仿的VLA学习,突显了其开发更可靠VLA策略的潜力。