Navigating unseen environments based on natural language instructions remains difficult for egocentric agents in Vision-and-Language Navigation (VLN). While recent advancements have yielded promising outcomes, they primarily rely on RGB images for environmental representation, often overlooking the underlying semantic knowledge and spatial cues. Intuitively, humans inherently ground textual semantics within the spatial layout during indoor navigation. Inspired by this, we propose a versatile Semantic Understanding and Spatial Awareness (SUSA) architecture to facilitate navigation. SUSA includes a Textual Semantic Understanding (TSU) module, which narrows the modality gap between instructions and environments by generating and associating the descriptions of environmental landmarks in the agent's immediate surroundings. Additionally, a Depth-based Spatial Perception (DSP) module incrementally constructs a depth exploration map, enabling a more nuanced comprehension of environmental layouts. Experimental results demonstrate that SUSA hybrid semantic-spatial representations effectively enhance navigation performance, setting new state-of-the-art performance across three VLN benchmarks (REVERIE, R2R, and SOON). The source code will be publicly available.
翻译:基于自然语言指令在未知环境中导航对于视觉语言导航(VLN)中的第一人称智能体而言仍具挑战性。尽管近期研究已取得显著进展,现有方法主要依赖RGB图像进行环境表征,往往忽略了底层的语义知识与空间线索。直觉上,人类在室内导航过程中会自然地将文本语义与空间布局进行关联。受此启发,我们提出了一种通用的语义理解与空间感知(SUSA)架构以促进导航任务。SUSA包含文本语义理解(TSU)模块,该模块通过生成并关联智能体周边环境中地标物体的描述,从而缩小指令与环境之间的模态差异。此外,基于深度的空间感知(DSP)模块能够增量式构建深度探索地图,实现对环境布局更精细的感知。实验结果表明,SUSA的混合语义-空间表征能有效提升导航性能,在三个VLN基准数据集(REVERIE、R2R和SOON)上均取得了最先进的性能。源代码将公开发布。