Recent vision-language-action (VLA) systems have demonstrated strong capabilities in embodied manipulation. However, most existing VLA policies rely on limited observation windows and end-to-end action prediction, which makes them brittle in long-horizon, memory-dependent tasks with partial observability, occlusions, and multi-stage dependencies. Such tasks require not only precise visuomotor control, but also persistent memory, adaptive task decomposition, and explicit recovery from execution failures. To address these limitations, we propose a dual-system framework for long-horizon embodied manipulation. Our framework explicitly separates high-level semantic reasoning from low-level motor execution. A high-level planner, implemented as a VLM-based agentic module, maintains structured task memory and performs goal decomposition, outcome verification, and error-driven correction. A low-level executor, instantiated as a VLA-based visuomotor controller, carries out each sub-task through diffusion-based action generation conditioned on geometry-preserving filtered observations. Together, the two systems form a closed loop between planning and execution, enabling memory-aware reasoning, adaptive replanning, and robust online recovery. Experiments on representative RMBench tasks show that the proposed framework substantially outperforms representative baselines, achieving a 32.4% average success rate compared with 9.8% for the strongest baseline. Ablation studies further confirm the importance of structured memory and closed-loop recovery for long-horizon manipulation.
翻译:摘要:近期视觉-语言-动作(VLA)系统在具身操作任务中展现出强大能力。然而,现有VLA策略大多依赖有限的观测窗口和端到端动作预测,这使其在部分可观测性、遮挡环境及多阶段依赖的长时域记忆密集型任务中表现脆弱。此类任务不仅要求精确的视觉运动控制,还需要持久记忆、自适应任务分解以及执行失败的显式恢复能力。为解决上述局限,我们提出一个用于长时域具身操作的双系统框架。该框架明确分离高层语义推理与低层运动执行:高层规划器作为基于VLM的智能体模块,维护结构化任务记忆,执行目标分解、结果验证及纠错驱动修正;低层执行器作为基于VLA的视觉运动控制器,通过扩散式动作生成机制,在保留几何结构的关键观测条件下执行每个子任务。两个系统形成规划与执行之间的闭环,实现记忆感知推理、自适应重规划及鲁棒在线恢复。在代表性RMBench任务上的实验表明,所提框架显著优于典型基线方法:平均成功率达32.4%,而最强基线仅为9.8%。消融实验进一步验证了结构化记忆与闭环恢复对长时域操作的关键作用。