Navigating complex urban environments using natural language instructions poses significant challenges for embodied agents, including noisy language instructions, ambiguous spatial references, diverse landmarks, and dynamic street scenes. Current visual navigation methods are typically limited to simulated or off-street environments, and often rely on precise goal formats, such as specific coordinates or images. This limits their effectiveness for autonomous agents like last-mile delivery robots navigating unfamiliar cities. To address these limitations, we introduce UrbanNav, a scalable framework that trains embodied agents to follow free-form language instructions in diverse urban settings. Leveraging web-scale city walking videos, we develop an scalable annotation pipeline that aligns human navigation trajectories with language instructions grounded in real-world landmarks. UrbanNav encompasses over 1,500 hours of navigation data and 3 million instruction-trajectory-landmark triplets, capturing a wide range of urban scenarios. Our model learns robust navigation policies to tackle complex urban scenarios, demonstrating superior spatial reasoning, robustness to noisy instructions, and generalization to unseen urban settings. Experimental results show that UrbanNav significantly outperforms existing methods, highlighting the potential of large-scale web video data to enable language-guided, real-world urban navigation for embodied agents.
翻译:利用自然语言指令在复杂城市环境中导航对具身智能体提出了重大挑战,包括噪声语言指令、模糊的空间指代、多样的地标以及动态的街景。当前的视觉导航方法通常局限于模拟或非街道环境,且往往依赖精确的目标格式,如特定坐标或图像。这限制了其在陌生城市中执行最后一英里配送等任务的自主智能体的有效性。为解决这些局限,我们提出了UrbanNav——一个可扩展的框架,用于训练具身智能体在多样化城市环境中遵循自由形式的语言指令。借助网络规模的城市步行视频,我们开发了可扩展的标注流程,将人类导航轨迹与基于真实世界地标的语言指令进行对齐。UrbanNav包含超过1,500小时的导航数据及300万条指令-轨迹-地标三元组,涵盖了广泛的城市场景。我们的模型通过学习鲁棒的导航策略来处理复杂城市场景,展现出卓越的空间推理能力、对噪声指令的鲁棒性以及对未见城市场景的泛化能力。实验结果表明,UrbanNav显著优于现有方法,凸显了大规模网络视频数据在实现具身智能体语言引导现实城市导航方面的潜力。