This paper proposes an interactive navigation framework by using large language and vision-language models, allowing robots to navigate in environments with traversable obstacles. We utilize the large language model (GPT-3.5) and the open-set Vision-language Model (Grounding DINO) to create an action-aware costmap to perform effective path planning without fine-tuning. With the large models, we can achieve an end-to-end system from textual instructions like "Can you pass through the curtains to deliver medicines to me?", to bounding boxes (e.g., curtains) with action-aware attributes. They can be used to segment LiDAR point clouds into two parts: traversable and untraversable parts, and then an action-aware costmap is constructed for generating a feasible path. The pre-trained large models have great generalization ability and do not require additional annotated data for training, allowing fast deployment in the interactive navigation tasks. We choose to use multiple traversable objects such as curtains and grasses for verification by instructing the robot to traverse them. Besides, traversing curtains in a medical scenario was tested. All experimental results demonstrated the proposed framework's effectiveness and adaptability to diverse environments.
翻译:本文提出了一种基于大型语言模型和视觉语言模型的交互式导航框架,使机器人能够在包含可穿越障碍物的环境中实现导航。我们利用大型语言模型(GPT-3.5)和开放集视觉语言模型(Grounding DINO)构建具有动作感知能力的代价地图,无需微调即可执行有效的路径规划。借助这些大型模型,我们实现了从文本指令(例如:"你能穿过窗帘把药递给我吗?")到具有动作感知属性的边界框(如窗帘)的端到端系统。这些边界框可用于将激光雷达点云分割为可穿越和不可穿越两部分,进而构建动作感知代价地图以生成可行路径。预训练的大型模型具有强大的泛化能力,无需额外标注数据即可快速部署于交互式导航任务中。我们选取了窗帘、草地等多种可穿越障碍物进行验证,通过指令引导机器人穿越这些障碍物,并针对医疗场景下的窗帘穿越进行了测试。所有实验结果均表明,所提框架在不同环境中具有良好的有效性与适应性。