We present BehAV, a novel approach for autonomous robot navigation in outdoor scenes guided by human instructions and leveraging Vision Language Models (VLMs). Our method interprets human commands using a Large Language Model (LLM) and categorizes the instructions into navigation and behavioral guidelines. Navigation guidelines consist of directional commands (e.g., "move forward until") and associated landmarks (e.g., "the building with blue windows"), while behavioral guidelines encompass regulatory actions (e.g., "stay on") and their corresponding objects (e.g., "pavements"). We use VLMs for their zero-shot scene understanding capabilities to estimate landmark locations from RGB images for robot navigation. Further, we introduce a novel scene representation that utilizes VLMs to ground behavioral rules into a behavioral cost map. This cost map encodes the presence of behavioral objects within the scene and assigns costs based on their regulatory actions. The behavioral cost map is integrated with a LiDAR-based occupancy map for navigation. To navigate outdoor scenes while adhering to the instructed behaviors, we present an unconstrained Model Predictive Control (MPC)-based planner that prioritizes both reaching landmarks and following behavioral guidelines. We evaluate the performance of BehAV on a quadruped robot across diverse real-world scenarios, demonstrating a 22.49% improvement in alignment with human-teleoperated actions, as measured by Frechet distance, and achieving a 40% higher navigation success rate compared to state-of-the-art methods.
翻译:本文提出BehAV,一种基于人类指令并利用视觉语言模型(VLMs)实现户外场景自主机器人导航的新方法。我们的方法使用大型语言模型(LLM)解析人类指令,并将其分类为导航指导与行为准则。导航指导包含方向性指令(例如“向前移动直至”)及其关联地标(例如“有蓝色窗户的建筑”),而行为准则则涵盖规范性动作(例如“保持在...上”)及其对应物体(例如“人行道”)。我们利用VLM的零样本场景理解能力,从RGB图像中估计地标位置以支持机器人导航。此外,我们引入一种新颖的场景表示方法,利用VLM将行为规则映射至行为代价图中。该代价图编码场景中行为物体的存在,并根据其规范性动作分配代价。行为代价图与基于激光雷达的占据栅格地图融合用于导航。为了在遵守指令行为的同时导航户外场景,我们提出一种基于无约束模型预测控制(MPC)的规划器,该规划器同时优化抵达地标与遵循行为准则的目标。我们在四足机器人上于多种真实场景中评估BehAV的性能,结果显示其与人类遥操作动作的吻合度(通过弗雷歇距离衡量)提升了22.49%,且导航成功率较现有最优方法提高了40%。