Long-horizon manipulation remains challenging for vision-language-action (VLA) policies: real tasks are multi-step, progress-dependent, and brittle to compounding execution errors. We present LoHo-Manip, a modular framework that scales short-horizon VLA execution to long-horizon instruction following via a dedicated task-management VLM. The manager is decoupled from the executor and is invoked in a receding-horizon manner: given the current observation, it predicts a progress-aware remaining plan that combines (i) a subtask sequence with an explicit done + remaining split as lightweight language memory, and (ii) a visual trace -- a compact 2D keypoint trajectory prompt specifying where to go and what to approach next. The executor VLA is adapted to condition on the rendered trace, thereby turning long-horizon decision-making into repeated local control by following the trace. Crucially, predicting the remaining plan at each step yields an implicit closed loop: failed steps persist in subsequent outputs, and traces update accordingly, enabling automatic continuation and replanning without hand-crafted recovery logic or brittle visual-history buffers. Extensive experiments spanning embodied planning, long-horizon reasoning, trajectory prediction, and end-to-end manipulation in simulation and on a real Franka robot demonstrate strong gains in long-horizon success, robustness, and out-of-distribution generalization. Project page: https://www.liuisabella.com/LoHoManip
翻译:长时域操作对视觉-语言-动作(VLA)策略仍具挑战性:实际任务具有多步骤、依赖进度且易受复合执行误差影响的特点。我们提出LoHo-Manip框架,通过专用任务管理视觉语言模型(VLM),将短时域VLA执行扩展到长时域指令遵循。管理器与执行器解耦,并以滚动时域方式调用:根据当前观测,预测包含进度的剩余计划,该计划结合了(i)带显式"完成+剩余"分割的子任务序列作为轻量级语言记忆,以及(ii)视觉轨迹——紧凑的2D关键点路径提示,指定下一步移动方向和接近目标。执行器VLA通过适应渲染轨迹进行条件化,将长时域决策转化为跟随轨迹的重复局部控制。关键在于,每步预测剩余计划可形成隐式闭环:失败步骤会持续出现在后续输出中,轨迹随之更新,无需手工设计的恢复逻辑或脆弱的视觉历史缓冲区即可实现自动延续与重规划。在具身规划、长时域推理、轨迹预测及仿真与真实Franka机器人端到端操作上的广泛实验表明,该方法在长时域成功率、鲁棒性及分布外泛化方面取得显著提升。项目页面:https://www.liuisabella.com/LoHoManip