NavAgent：面向无人机具身视觉语言导航的多尺度城市街景融合 (NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation)

Vision-and-Language Navigation (VLN), as a widely discussed research direction in embodied intelligence, aims to enable embodied agents to navigate in complicated visual environments through natural language commands. Most existing VLN methods focus on indoor ground robot scenarios. However, when applied to UAV VLN in outdoor urban scenes, it faces two significant challenges. First, urban scenes contain numerous objects, which makes it challenging to match fine-grained landmarks in images with complex textual descriptions of these landmarks. Second, overall environmental information encompasses multiple modal dimensions, and the diversity of representations significantly increases the complexity of the encoding process. To address these challenges, we propose NavAgent, the first urban UAV embodied navigation model driven by a large Vision-Language Model. NavAgent undertakes navigation tasks by synthesizing multi-scale environmental information, including topological maps (global), panoramas (medium), and fine-grained landmarks (local). Specifically, we utilize GLIP to build a visual recognizer for landmark capable of identifying and linguisticizing fine-grained landmarks. Subsequently, we develop dynamically growing scene topology map that integrate environmental information and employ Graph Convolutional Networks to encode global environmental data. In addition, to train the visual recognizer for landmark, we develop NavAgent-Landmark2K, the first fine-grained landmark dataset for real urban street scenes. In experiments conducted on the Touchdown and Map2seq datasets, NavAgent outperforms strong baseline models. The code and dataset will be released to the community to facilitate the exploration and development of outdoor VLN.

翻译：视觉语言导航（VLN）作为具身智能领域广泛探讨的研究方向，旨在使具身智能体能够通过自然语言指令在复杂视觉环境中进行导航。现有VLN方法大多聚焦于室内地面机器人场景。然而，当应用于室外城市场景的无人机VLN时，面临两大挑战：其一，城市场景包含大量物体，使得图像中的细粒度地标与复杂文本描述难以匹配；其二，整体环境信息涵盖多模态维度，表征的多样性显著增加了编码过程的复杂性。为应对这些挑战，我们提出NavAgent——首个由大规模视觉语言模型驱动的城市无人机具身导航模型。NavAgent通过综合多尺度环境信息（包括拓扑地图（全局）、全景图像（中观）与细粒度地标（局部））执行导航任务。具体而言，我们利用GLIP构建能够识别并语言化细粒度地标的视觉识别器；继而开发融合环境信息的动态增长场景拓扑图，并采用图卷积网络编码全局环境数据。此外，为训练地标视觉识别器，我们构建了首个面向真实城市街景的细粒度地标数据集NavAgent-Landmark2K。在Touchdown与Map2seq数据集上的实验表明，NavAgent性能优于现有强基线模型。代码与数据集将向社区开源，以促进室外VLN的探索与发展。