Vision-Language-Action (VLA) models promise generalist robot manipulation, but are typically trained and deployed as short-horizon policies that assume the latest observation is sufficient for action reasoning. This assumption breaks in non-Markovian long-horizon tasks, where task-relevant evidence can be occluded or appear only earlier in the trajectory, and where clutter and distractors make fine-grained visual grounding brittle. We present CodeGraphVLP, a hierarchical framework that enables reliable long-horizon manipulation by combining a persistent semantic-graph state with an executable code-based planner and progress-guided visual-language prompting. The semantic-graph maintains task-relevant entities and relations under partial observability. The synthesized planner executes over this semantic-graph to perform efficient progress checks and outputs a subtask instruction together with subtask-relevant objects. We use these outputs to construct clutter-suppressed observations that focus the VLA executor on critical evidence. On real-world non-Markovian tasks, CodeGraphVLP improves task completion over strong VLA baselines and history-enabled variants while substantially lowering planning latency compared to VLM-in-the-loop planning. We also conduct extensive ablation studies to confirm the contributions of each component.
翻译:视觉-语言-动作(VLA)模型有望实现通用型机器人操作,但通常作为短视策略进行训练和部署,假设最新观测足以支撑动作推理。这一假设在非马尔可夫长时域任务中失效——此时任务相关证据可能被遮挡或仅出现在轨迹早期阶段,而杂乱场景与干扰物更导致细粒度视觉定位脆弱。我们提出CodeGraphVLP,一种通过融合持久化语义图状态、可执行代码型规划器及进度引导的视觉-语言提示实现的层次化框架,能够支持可靠的长时域操作。语义图在部分可观测条件下维护任务相关实体及其关系;合成规划器基于该语义图执行高效进度检查,输出子任务指令及关联目标对象。我们利用这些输出构建去噪观测,将VLA执行器聚焦于关键证据。在真实世界非马尔可夫任务中,CodeGraphVLP相比强VLA基线及历史增强变体显著提升任务完成率,同时相较于基于视觉-语言大模型(VLM)的循环规划,大幅降低规划延迟。我们亦通过广泛消融实验验证各模块贡献。