We propose VLM-Social-Nav, a novel Vision-Language Model (VLM) based navigation approach to compute a robot's motion in human-centered environments. Our goal is to make real-time decisions on robot actions that are socially compliant with human expectations. We utilize a perception model to detect important social entities and prompt a VLM to generate guidance for socially compliant robot behavior. VLM-Social-Nav uses a VLM-based scoring module that computes a cost term that ensures socially appropriate and effective robot actions generated by the underlying planner. Our overall approach reduces reliance on large training datasets and enhances adaptability in decision-making. In practice, it results in improved socially compliant navigation in human-shared environments. We demonstrate and evaluate our system in four different real-world social navigation scenarios with a Turtlebot robot. We observe at least 27.38% improvement in the average success rate and 19.05% improvement in the average collision rate in the four social navigation scenarios. Our user study score shows that VLM-Social-Nav generates the most socially compliant navigation behavior.
翻译:我们提出了VLM-Social-Nav,一种基于视觉语言模型(VLM)的新型导航方法,用于计算以人为中心环境中的机器人运动。我们的目标是对机器人动作做出实时决策,使其行为符合社会规范并满足人类期望。我们利用感知模型检测重要的社会实体,并提示VLM生成符合社会规范的机器人行为指导。VLM-Social-Nav采用基于VLM的评分模块,该模块计算一个成本项,以确保底层规划器生成的机器人动作既符合社会规范又高效。我们的整体方法减少了对大型训练数据集的依赖,并增强了决策的适应性。在实践中,该方法提升了机器人在人机共享环境中的社会合规导航性能。我们在四种不同的真实世界社交导航场景中使用Turtlebot机器人对系统进行了演示和评估。在四种社交导航场景中,我们观察到平均成功率至少提升了27.38%,平均碰撞率改善了19.05%。我们的用户研究评分表明,VLM-Social-Nav生成了最具社会合规性的导航行为。