Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges. One critical issue is bridging the gap between discrete entities in low-level observations and the abstract concepts required for effective planning. A common solution is building hierarchical agents, where VLMs serve as high-level reasoners that break down tasks into executable sub-tasks, typically specified using language. However, language suffers from the inability to communicate detailed spatial information. We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from past observations to guide policy-environment interactions. Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, supported by real-time object tracking from SAM-2. Our method unlocks the potential of VLMs, enabling them to tackle complex tasks that demand spatial reasoning. Experiments in Minecraft show that our approach enables agents to achieve previously unattainable tasks, with a $\mathbf{76}\%$ absolute improvement in open-world interaction performance. Codes and demos are now available on the project page: https://craftjarvis.github.io/ROCKET-1.
翻译:视觉-语言模型(VLMs)在多模态任务中表现出色,但将其应用于开放世界环境中的具身决策仍面临挑战。一个关键问题在于弥合低级观测中的离散实体与有效规划所需的抽象概念之间的鸿沟。常见的解决方案是构建分层智能体,其中VLMs作为高层推理器,将任务分解为可执行的子任务,这些子任务通常使用语言进行描述。然而,语言无法传达详细的空间信息。我们提出视觉-时序上下文提示,一种VLMs与策略模型之间的新型通信协议。该协议利用过去观测中的对象分割来指导策略与环境的交互。基于此方法,我们训练了ROCKET-1,这是一种低级策略,它基于拼接的视觉观测和分割掩码来预测动作,并得到SAM-2实时对象跟踪的支持。我们的方法释放了VLMs的潜力,使其能够处理需要空间推理的复杂任务。在《我的世界》中的实验表明,我们的方法使智能体能够完成先前无法实现的任务,在开放世界交互性能上实现了$\mathbf{76}\%$的绝对提升。代码和演示已在项目页面发布:https://craftjarvis.github.io/ROCKET-1。