This paper proposes an interactive navigation framework by using large language and vision-language models, allowing robots to navigate in environments with traversable obstacles. We utilize the large language model (GPT-3.5) and the open-set Vision-language Model (Grounding DINO) to create an action-aware costmap to perform effective path planning without fine-tuning. With the large models, we can achieve an end-to-end system from textual instructions like "Can you pass through the curtains to deliver medicines to me?", to bounding boxes (e.g., curtains) with action-aware attributes. They can be used to segment LiDAR point clouds into two parts: traversable and untraversable parts, and then an action-aware costmap is constructed for generating a feasible path. The pre-trained large models have great generalization ability and do not require additional annotated data for training, allowing fast deployment in the interactive navigation tasks. We choose to use multiple traversable objects such as curtains and grasses for verification by instructing the robot to traverse them. Besides, traversing curtains in a medical scenario was tested. All experimental results demonstrated the proposed framework's effectiveness and adaptability to diverse environments.
翻译:本文提出了一种交互式导航框架,通过利用大规模语言模型和视觉-语言模型,使机器人能够在包含可穿越障碍物的环境中进行导航。我们采用大规模语言模型(GPT-3.5)和开放集视觉-语言模型(Grounding DINO)构建一种动作感知代价地图,无需微调即可实现有效路径规划。借助这些大模型,我们能够构建从文本指令(如“你能穿过窗帘把药递给我吗?”)到带有动作感知属性的边界框(例如窗帘)的端到端系统。这些边界框可用于将激光雷达点云分割为可穿越和不可穿越两部分,进而生成用于规划可行路径的动作感知代价地图。预训练的大模型具备强大的泛化能力,无需额外标注数据进行训练,从而可快速部署于交互式导航任务中。我们选取窗帘、草丛等多种可穿越物体进行验证,通过指令控制机器人穿越这些物体。此外,还测试了医疗场景中穿越窗帘的任务。所有实验结果均表明,所提框架在不同环境中具有有效性和适应性。