Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities and relaxing key assumptions. We propose the Human-Aware 3D (HA3D) simulator, which combines dynamic human activities with the Matterport3D dataset, and the Human-Aware Room-to-Room (HA-R2R) dataset, extending R2R with human activity descriptions. To tackle HA-VLN challenges, we present the Expert-Supervised Cross-Modal (VLN-CM) and Non-Expert-Supervised Decision Transformer (VLN-DT) agents, utilizing cross-modal fusion and diverse training strategies for effective navigation in dynamic human environments. A comprehensive evaluation, including metrics considering human activities, and systematic analysis of HA-VLN's unique challenges, underscores the need for further research to enhance HA-VLN agents' real-world robustness and adaptability. Ultimately, this work provides benchmarks and insights for future research on embodied AI and Sim2Real transfer, paving the way for more realistic and applicable VLN systems in human-populated environments.
翻译:视觉语言导航(VLN)旨在开发能够根据人类指令进行导航的具身智能体。然而,当前的VLN框架通常依赖于静态环境和最优专家监督,限制了其在实际场景中的应用。为解决这一问题,我们提出了人机交互感知的视觉语言导航(HA-VLN),通过引入动态人类活动并放宽关键假设来扩展传统VLN。我们开发了人机交互感知三维(HA3D)仿真器,将动态人类活动与Matterport3D数据集相结合,并构建了人机交互感知房间到房间(HA-R2R)数据集,通过人类活动描述扩展了R2R数据集。为应对HA-VLN的挑战,我们提出了专家监督跨模态(VLN-CM)与非专家监督决策Transformer(VLN-DT)智能体,利用跨模态融合和多样化训练策略在动态人类环境中实现高效导航。综合评估(包括考虑人类活动的评价指标)以及对HA-VLN特有挑战的系统性分析表明,需要进一步研究以增强HA-VLN智能体在现实世界中的鲁棒性和适应性。最终,本研究为具身人工智能和仿真到现实迁移的未来研究提供了基准与见解,为在人类活动环境中构建更真实、更适用的VLN系统奠定了基础。