Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities and relaxing key assumptions. We propose the Human-Aware 3D (HA3D) simulator, which combines dynamic human activities with the Matterport3D dataset, and the Human-Aware Room-to-Room (HA-R2R) dataset, extending R2R with human activity descriptions. To tackle HA-VLN challenges, we present the Expert-Supervised Cross-Modal (VLN-CM) and Non-Expert-Supervised Decision Transformer (VLN-DT) agents, utilizing cross-modal fusion and diverse training strategies for effective navigation in dynamic human environments. A comprehensive evaluation, including metrics considering human activities, and systematic analysis of HA-VLN's unique challenges, underscores the need for further research to enhance HA-VLN agents' real-world robustness and adaptability. Ultimately, this work provides benchmarks and insights for future research on embodied AI and Sim2Real transfer, paving the way for more realistic and applicable VLN systems in human-populated environments.
翻译:视觉语言导航旨在开发能够根据人类指令进行导航的具身智能体。然而,当前的视觉语言导航框架通常依赖于静态环境和最优专家监督,限制了其在现实世界中的适用性。为解决这一问题,我们提出了人机交互感知的视觉语言导航,通过引入动态人类活动并放宽关键假设,对传统视觉语言导航进行了扩展。我们提出了人机交互感知三维仿真器,该仿真器将动态人类活动与Matterport3D数据集相结合;同时构建了人机交互感知房间到房间数据集,通过在R2R数据集中加入人类活动描述来扩展原有数据集。为应对人机交互感知视觉语言导航的挑战,我们提出了专家监督跨模态智能体与非专家监督决策Transformer智能体,利用跨模态融合和多样化训练策略,实现在动态人类环境中的有效导航。通过包含人类活动考量指标的综合性评估,以及对人机交互感知视觉语言导航特有挑战的系统性分析,本研究强调了需要进一步研究以增强人机交互感知视觉语言导航智能体在现实世界中的鲁棒性和适应性。最终,这项工作为具身人工智能和仿真到现实迁移的未来研究提供了基准与见解,为在人类活动环境中构建更真实、更适用的视觉语言导航系统铺平了道路。