This paper proposes an interactive navigation framework by using large language and vision-language models, allowing robots to navigate in environments with traversable obstacles. We utilize the large language model (GPT-3.5) and the open-set Vision-language Model (Grounding DINO) to create an action-aware costmap to perform effective path planning without fine-tuning. With the large models, we can achieve an end-to-end system from textual instructions like "Can you pass through the curtains to deliver medicines to me?", to bounding boxes (e.g., curtains) with action-aware attributes. They can be used to segment LiDAR point clouds into two parts: traversable and untraversable parts, and then an action-aware costmap is constructed for generating a feasible path. The pre-trained large models have great generalization ability and do not require additional annotated data for training, allowing fast deployment in the interactive navigation tasks. We choose to use multiple traversable objects such as curtains and grasses for verification by instructing the robot to traverse them. Besides, traversing curtains in a medical scenario was tested. All experimental results demonstrated the proposed framework's effectiveness and adaptability to diverse environments.
翻译:本文提出一种利用大语言模型和视觉-语言模型的交互式导航框架,使机器人能够在包含可穿越障碍物的环境中实现导航。我们采用大语言模型(GPT-3.5)和开放集视觉-语言模型(Grounding DINO)构建动作感知代价地图,无需微调即可实现高效路径规划。借助大模型,我们能够构建从文本指令(例如“你能穿过窗帘把药送给我吗?”)到带有动作感知属性的边界框(如窗帘)的端到端系统。这些边界框可用于将激光雷达点云分割为可穿越与不可穿越两部分,进而构建动作感知代价地图以生成可行路径。预训练大模型具备出色的泛化能力,无需额外标注数据即可训练,从而实现交互式导航任务的快速部署。我们选取窗帘、草地等多种可穿越障碍物进行验证,通过指令引导机器人穿越这些物体。此外,还测试了医疗场景中穿越窗帘的任务。所有实验结果均证明了所提框架的有效性及其对不同环境的适应性。