Aerial Vision-and-Language Navigation (Aerial VLN) enables unmanned aerial vehicles (UAVs) to follow natural language instructions and navigate complex urban environments. While recent advances have achieved progress through large-scale memory graphs and lookahead path planning, they remain limited by shallow instruction understanding and high computational cost. In particular, existing methods rely primarily on landmark descriptions, overlooking directional cues "a key source of spatial context in human navigation". In this work, we propose LookasideVLN, a new paradigm that exploits directional cues in natural language to achieve both more accurate spatial reasoning and greater computational efficiency. LookasideVLN comprises three core components: (1) an Egocentric Lookaside Graph (ELG) that dynamically encodes instruction-relevant landmarks and their directional relationships, (2) a Spatial Landmark Knowledge Base (SLKB) that provides lightweight memory retrieval from prior navigation experiences, and (3) a Lookaside MLLM Navigation Agent that aligns multimodal information from user instructions, visual observations, and landmark-direction information from ELG for path planning. Extensive experiments show that LookasideVLN significantly outperforms the state-of-the-art CityNavAgent, even with a single-level lookahead, demonstrating that leveraging directional cues is a powerful yet efficient strategy for Aerial VLN.
翻译:空中视觉与语言导航(Aerial VLN)使无人机能够遵循自然语言指令并在复杂城市环境中导航。尽管近期研究通过大规模记忆图和前视路径规划取得了进展,但仍受限于浅层指令理解和高计算成本。具体而言,现有方法主要依赖地标描述,却忽略了方向线索——人类导航中空间语境的关键来源。本文提出LookasideVLN——一种利用自然语言中的方向线索实现更精确空间推理与更高计算效率的新范式。LookasideVLN包含三个核心组件:(1)自我中心侧视图(ELG),用于动态编码指令相关地标及其方向关系;(2)空间地标知识库(SLKB),可从先前的导航经验中实现轻量级记忆检索;(3)侧视MLLM导航智能体,通过对齐用户指令、视觉观察以及来自ELG的地标-方向信息进行路径规划。大量实验表明,即使仅使用单层前视,LookasideVLN也显著优于当前最先进的CityNavAgent,证明利用方向线索是空中视觉与语言导航中一种强大且高效的策略。