Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities and relaxing key assumptions. We propose the Human-Aware 3D (HA3D) simulator, which combines dynamic human activities with the Matterport3D dataset, and the Human-Aware Room-to-Room (HA-R2R) dataset, extending R2R with human activity descriptions. To tackle HA-VLN challenges, we present the Expert-Supervised Cross-Modal (VLN-CM) and Non-Expert-Supervised Decision Transformer (VLN-DT) agents, utilizing cross-modal fusion and diverse training strategies for effective navigation in dynamic human environments. A comprehensive evaluation, including metrics considering human activities, and systematic analysis of HA-VLN's unique challenges, underscores the need for further research to enhance HA-VLN agents' real-world robustness and adaptability. Ultimately, this work provides benchmarks and insights for future research on embodied AI and Sim2Real transfer, paving the way for more realistic and applicable VLN systems in human-populated environments.
翻译:视觉语言导航(VLN)旨在开发能够根据人类指令进行导航的具身智能体。然而,当前的VLN框架通常依赖于静态环境和最优专家监督,限制了其在实际场景中的应用。为此,我们提出了人机协同视觉语言导航(HA-VLN),通过引入动态人类活动并放宽关键假设,对传统VLN进行了扩展。我们开发了人机协同三维(HA3D)仿真器,将动态人类活动与Matterport3D数据集相结合,并构建了人机协同房间到房间(HA-R2R)数据集,在R2R基础上增加了人类活动描述。为应对HA-VLN的挑战,我们提出了专家监督跨模态(VLN-CM)与非专家监督决策Transformer(VLN-DT)智能体,利用跨模态融合和多样化训练策略,实现在动态人类环境中的有效导航。通过包含人类活动考量指标的综合评估,以及对HA-VLN特有挑战的系统性分析,本研究强调了需要进一步研究以增强HA-VLN智能体在现实世界中的鲁棒性和适应性。最终,这项工作为具身人工智能和仿真到现实迁移的未来研究提供了基准与见解,为在人类活动环境中构建更真实、更适用的VLN系统铺平了道路。