Socially compliant navigation requires robots to move safely and appropriately in human-centered environments by respecting social norms. However, social norms are often ambiguous, and in a single scenario, multiple actions may be equally acceptable. Most existing methods simplify this problem by assuming a single correct action, which limits their ability to handle real-world social uncertainty. In this work, we propose MAction-SocialNav, an efficient vision language model for socially compliant navigation that explicitly addresses action ambiguity, enabling generating multiple plausible actions within one scenario. To enhance the model's reasoning capability, we introduce a novel meta-cognitive prompt (MCP) method. Furthermore, to evaluate the proposed method, we curate a multi-action socially compliant navigation dataset that accounts for diverse conditions, including crowd density, indoor and outdoor environments, and dual human annotations. The dataset contains 789 samples, each with three-turn conversation, split into 710 training samples and 79 test samples through random selection. We also design five evaluation metrics to assess high-level decision precision, safety, and diversity. Extensive experiments demonstrate that the proposed MAction-SocialNav achieves strong social reasoning performance while maintaining high efficiency, highlighting its potential for real-world human robot navigation. Compared with zero-shot GPT-4o and Claude, our model achieves substantially higher decision quality (APG: 0.595 vs. 0.000/0.025) and safety alignment (ER: 0.264 vs. 0.642/0.668), while maintaining real-time efficiency (1.524 FPS, over 3x faster).
翻译:社会合规导航要求机器人在以人为中心的环境中,通过遵守社会规范,安全且适当地移动。然而,社会规范常常具有模糊性,在单一场景中,多种动作可能同样可接受。大多数现有方法通过假设存在单一正确动作来简化此问题,这限制了它们处理现实世界社会不确定性的能力。在本工作中,我们提出了MAction-SocialNav,一种用于社会合规导航的高效视觉语言模型,它明确处理动作模糊性,能够在单一场景中生成多个合理的动作。为了增强模型的推理能力,我们引入了一种新颖的元认知提示方法。此外,为了评估所提出的方法,我们构建了一个多动作社会合规导航数据集,该数据集考虑了多种条件,包括人群密度、室内外环境以及双人标注。该数据集包含789个样本,每个样本包含三轮对话,通过随机选择划分为710个训练样本和79个测试样本。我们还设计了五个评估指标来衡量高层决策的精确性、安全性和多样性。大量实验表明,所提出的MAction-SocialNav在保持高效率的同时,实现了强大的社会推理性能,凸显了其在现实世界人机导航中的潜力。与零样本的GPT-4o和Claude相比,我们的模型实现了显著更高的决策质量(APG:0.595 对比 0.000/0.025)和安全对齐(ER:0.264 对比 0.642/0.668),同时保持了实时效率(1.524 FPS,快3倍以上)。