Vision-language action (VLA) policies often report strong manipulation benchmark performance with relatively few demonstrations, but it remains unclear whether this reflects robust language-to-object grounding or reliance on object--location correlations that do not transfer beyond the training distribution. We present a controlled multi-object picking study that progressively increases object placement variability up to full workspace randomization and evaluates held-out object--location pairings that break familiar associations without increasing spatial difficulty. Across these stress tests and data scaling, we find that for representative VLA policies, including SmolVLA and $π_{0.5}$, execution of the manipulation primitive remains substantially more reliable than instruction-conditioned task success in harder regimes, suggesting that manipulation skill acquisition is decoupled from instruction following. We recommend augmenting manipulation benchmarks with task ladders and decomposed metrics that separately measure primitive execution and instruction-conditioned success to better diagnose instruction-grounded generalization.
翻译:视觉-语言动作(VLA)策略通常仅需少量演示即可在操作基准测试中表现出色,但尚不清楚这反映了稳健的语言-对象关联能力,还是源于对无法迁移至训练分布之外的对象-位置相关性的依赖。本文提出一项受控多对象拾取研究,通过逐步增加对象放置变异性直至完全的工作空间随机化,并评估打破熟悉关联但未增加空间难度的保留对象-位置配对。经过系列压力测试与数据规模分析,我们发现对于包括SmolVLA与$π_{0.5}$在内的典型VLA策略,在复杂场景中操作基元的执行可靠性始终显著高于指令条件化任务成功率,这表明操作技能习得与指令跟随能力存在解耦现象。我们建议在操作基准测试中引入任务阶梯与解耦指标,分别测量基元执行与指令条件化成功率,以更精准诊断基于指令的泛化能力。