Vision-and-Language Navigation (VLN) has been studied mainly in either discrete or continuous spaces, with little attention to dynamic, crowded environments. We present HA-VLN 2.0, a unified benchmark introducing explicit social-awareness constraints. Our contributions are: (i) a standardized task and metrics capturing both goal accuracy and personal-space adherence; (ii) HAPS 2.0 dataset and simulators modeling multi-human interactions, outdoor contexts, and finer language-motion alignment; (iii) benchmarks on 16,844 socially grounded instructions, revealing sharp performance drops of leading agents under human dynamics and partial observability; and (iv) real-world robot experiments validating sim-to-real transfer, with an open leaderboard enabling transparent comparison. Results show that explicit social modeling improves navigation robustness and reduces collisions, underscoring necessity of human-centric approaches. By releasing datasets, simulators, baselines, and protocols, HA-VLN 2.0 provides a strong foundation for safe, human-aware navigation research.
翻译:视觉-语言导航(VLN)主要在离散或连续空间中得到研究,而对动态、拥挤环境的关注较少。我们提出HA-VLN 2.0,一个引入显式社交感知约束的统一基准。我们的贡献包括:(i) 一个标准化任务及同时捕捉目标准确性和个人空间遵守度的指标;(ii) HAPS 2.0数据集与模拟器,模拟多人交互、室外场景及更精细的语言-运动对齐;(iii) 基于16,844个社交约束指令的基准测试,揭示领先智能体在人类动力学和部分可观测性下的性能显著下降;(iv) 真实世界机器人实验验证了仿真到现实的迁移,并提供一个开放排行榜以实现透明比较。结果表明,显式社交建模能提升导航鲁棒性并减少碰撞,凸显了以人为中心方法的必要性。通过发布数据集、模拟器、基线和协议,HA-VLN 2.0为安全、人类感知导航研究奠定了坚实基础。