Open-vocabulary long-horizon manipulation requires robots to reason over flexible instructions and complex multi-object scenes while adaptively planning, executing, monitoring, and recovering from failures. We address these demands with a closed agent loop in which a VLM orchestrates heterogeneous robot capabilities as interruptible tools. Unlike in virtual AI agents, the timing of decisions, actions and tool calls is important in a physical world that does not pause for reasoning. We refer to this setting as Physical Orchestration, and propose VoLoAgent, a VLM that plans, monitors, and recovers by treating a VLA/WAM as an interruptible tool it steers mid-rollout alongside vision models and action primitives. To evaluate these long-horizon capabilities, we introduce RoboVoLo, a high-fidelity benchmark for open-vocabulary long-horizon manipulation across common sense, memory/state tracking, complex references, and world knowledge, with both task-level success and failure-mode diagnostics. Experiments show VoLoAgent substantially outperforms single VLA/VLM or tool-based systems, with validation on real-robot experiments. Project page: https://chicychen.github.io/VoLo/
翻译:开放词汇长时程操控要求机器人既能理解灵活指令与复杂多物体场景,又能自适应地规划、执行、监控任务并从失败中恢复。为此,我们构建了一个闭环智能体系统,其中视觉语言模型(VLM)将异构机器人能力编排为可中断工具。与虚拟AI智能体不同,在物理世界中决策、动作与工具调用的时序至关重要——物理世界不会为推理而暂停。我们将此设定称为物理编排,并提出VoLoAgent——一种将视觉-语言-动作模型(VLA)/语义智能体模型(WAM)作为可中断工具进行滚动执行中引导的VLM,使其能与视觉模型及动作原语协同完成规划、监控与恢复。为评估上述长时程能力,我们引入RoboVoLo——一个面向开放词汇长时程操控的高保真基准测试集,涵盖常识推理、记忆/状态追踪、复杂指代及世界知识,并提供任务级成功指标与失效模式诊断。实验表明,VoLoAgent在任务成功率与失效模式诊断上显著优于单一VLA/VLM系统或基于工具的系统,并在真实机器人实验中验证了有效性。项目页面:https://chicychen.github.io/VoLo/