Pretrained vision-language models (VLMs) can make semantic and visual inferences across diverse settings, providing valuable common-sense priors for robotic control. However, effectively grounding this knowledge in robot behaviors remains an open challenge. Prior methods often employ a hierarchical approach where VLMs reason over high-level commands to be executed by separate low-level policies, e.g., vision-language-action models (VLAs). The interface between VLMs and VLAs is usually natural language task instructions, which fundamentally limits how much VLM reasoning can steer low-level behavior. We thus introduce Steerable Policies: VLAs trained on rich synthetic commands at various levels of abstraction, like subtasks, motions, and grounded pixel coordinates. By improving low-level controllability, Steerable Policies can unlock pretrained knowledge in VLMs, enabling improved task generalization. We demonstrate this benefit by controlling our Steerable Policies with both a learned high-level embodied reasoner and an off-the-shelf VLM prompted to reason over command abstractions via in-context learning. Across extensive real-world manipulation experiments, these two novel methods outperform prior embodied reasoning VLAs and VLM-based hierarchical baselines, including on challenging generalization and long-horizon tasks. Website: steerable-policies.github.io
翻译:预训练的视觉-语言模型(VLMs)能够在多样化场景中进行语义与视觉推断,为机器人控制提供了宝贵的常识先验。然而,如何有效地将这些知识落地到机器人行为中仍是一个开放挑战。现有方法通常采用分层架构,即VLMs对高层指令进行推理,再由独立的低层策略(例如视觉-语言-动作模型(VLAs))执行。VLMs与VLAs之间的接口通常是自然语言任务指令,这从根本上限制了VLM推理对低层行为的引导能力。为此,我们提出了可操控策略:一种在多种抽象层级(如子任务、动作及基于像素坐标的接地指令)的丰富合成指令上训练的VLA模型。通过提升低层可控性,可操控策略能够释放预训练VLM中的知识,从而提升任务泛化能力。我们通过两种方式验证了这一优势:使用学习到的高层具身推理器,以及采用现成VLM结合上下文学习进行指令抽象推理,来操控我们的可操控策略。在大量真实世界操作实验中,这两种新方法均超越了现有的具身推理VLA模型和基于VLM的分层基线方法,在挑战性泛化任务和长时程任务上表现尤为突出。项目网站:steerable-policies.github.io