Vision-Language-Action (VLA) models typically map visual observations and linguistic instructions directly to robotic control signals. This "black-box" mapping forces a single forward pass to simultaneously handle instruction interpretation, spatial grounding, and low-level control, often leading to poor spatial precision and limited robustness in out-of-distribution scenarios. To address these limitations, we propose VP-VLA, a dual-system framework that decouples high-level reasoning and low-level execution via a structured visual prompting interface. Specifically, a "System 2 Planner" decomposes complex instructions into sub-tasks and identifies relevant target objects and goal locations. These spatial anchors are then overlaid directly onto visual observations as structured visual prompts, such as crosshairs and bounding boxes. Guided by these prompts and enhanced by a novel auxiliary visual grounding objective during training, a "System 1 Controller" reliably generates precise low-level execution motions. Experiments on the Robocasa-GR1-Tabletop benchmark and SimplerEnv simulation demonstrate that VP-VLA improves success rates by 5% and 8.3%, surpassing competitive baselines including QwenOFT and GR00T-N1.6.
翻译:视觉-语言-动作(VLA)模型通常将视觉观测和语言指令直接映射为机器人控制信号。这种"黑箱"映射迫使单次前馈过程同时处理指令理解、空间定位和底层控制,往往导致空间精度不足及在分布外场景下的鲁棒性受限。针对上述局限,我们提出VP-VLA——一种通过结构化视觉提示接口解耦高层推理与底层执行的双系统框架。具体而言,"系统2规划器"将复杂指令分解为子任务,并识别相关目标物体与目标位置,随后将这些空间锚点以结构化视觉提示(如十字准星和边界框)的形式直接叠加于视觉观测之上。在训练过程中,经由这些提示与新型辅助视觉定位目标的引导,"系统1控制器"能够可靠地生成精确的底层运动执行指令。在Robocasa-GR1-Tabletop基准测试与SimplerEnv仿真环境中的实验表明,VP-VLA将成功率分别提升5%和8.3%,超越了包括QwenOFT和GR00T-N1.6在内的竞品基线模型。