Legged robots are physically capable of navigating a diverse variety of environments and overcoming a wide range of obstructions. For example, in a search and rescue mission, a legged robot could climb over debris, crawl through gaps, and navigate out of dead ends. However, the robot's controller needs to respond intelligently to such varied obstacles, and this requires handling unexpected and unusual scenarios successfully. This presents an open challenge to current learning methods, which often struggle with generalization to the long tail of unexpected situations without heavy human supervision. To address this issue, we investigate how to leverage the broad knowledge about the structure of the world and commonsense reasoning capabilities of vision-language models (VLMs) to aid legged robots in handling difficult, ambiguous situations. We propose a system, VLM-Predictive Control (VLM-PC), combining two key components that we find to be crucial for eliciting on-the-fly, adaptive behavior selection with VLMs: (1) in-context adaptation over previous robot interactions and (2) planning multiple skills into the future and replanning. We evaluate VLM-PC on several challenging real-world obstacle courses, involving dead ends and climbing and crawling, on a Go1 quadruped robot. Our experiments show that by reasoning over the history of interactions and future plans, VLMs enable the robot to autonomously perceive, navigate, and act in a wide range of complex scenarios that would otherwise require environment-specific engineering or human guidance.
翻译:腿式机器人在物理上能够适应多种环境并克服各类障碍。例如,在搜救任务中,腿式机器人可攀越瓦砾、穿越缝隙并摆脱死胡同。然而,机器人控制器需对此类多变障碍作出智能响应,这要求其能成功应对意外及非常规场景。这对当前学习方法构成了公开挑战——这些方法在缺乏大量人工监督时,往往难以泛化至长尾分布的意外情况。为解决该问题,我们研究如何利用视觉语言模型(VLMs)中关于世界结构的广泛知识与常识推理能力,以协助腿式机器人处理困难且模糊的情境。我们提出VLM预测控制系统(VLM-PC),其融合了两个对实现VLM即时自适应行为选择至关重要的组件:(1)基于历史交互的上下文自适应机制;(2)面向多技能的未来规划与重规划能力。我们在搭载Go1四足机器人的多个真实世界障碍场景(包含死胡同、攀爬与匍匐任务)中对VLM-PC进行评估。实验表明,通过对交互历史与未来规划进行推理,VLM使机器人能够在广泛复杂场景中自主感知、导航与行动,而此类场景通常需要针对特定环境进行工程优化或依赖人工引导。