Vision-and-Language Navigation (VLN) is a realistic but challenging task that requires an agent to locate the target region using verbal and visual cues. While significant advancements have been achieved recently, there are still two broad limitations: (1) The explicit information mining for significant guiding semantics concealed in both vision and language is still under-explored; (2) The previously structured map method provides the average historical appearance of visited nodes, while it ignores distinctive contributions of various images and potent information retention in the reasoning process. This work proposes a dual semantic-aware recurrent global-adaptive network (DSRG) to address the above problems. First, DSRG proposes an instruction-guidance linguistic module (IGL) and an appearance-semantics visual module (ASV) for boosting vision and language semantic learning respectively. For the memory mechanism, a global adaptive aggregation module (GAA) is devised for explicit panoramic observation fusion, and a recurrent memory fusion module (RMF) is introduced to supply implicit temporal hidden states. Extensive experimental results on the R2R and REVERIE datasets demonstrate that our method achieves better performance than existing methods.
翻译:视觉与语言导航(VLN)是一项现实但具有挑战性的任务,要求智能体利用语言和视觉线索定位目标区域。尽管近期取得了显著进展,但仍存在两大局限性:(1)对视觉与语言中蕴含的重要引导语义进行显式信息挖掘的研究尚不充分;(2)以往基于结构化地图的方法仅提供已访问节点的平均历史外观,忽视了不同图像的独特贡献以及推理过程中有效信息的保留。本文提出了一种双语义感知循环全局自适应网络(DSRG)以解决上述问题。首先,DSRG分别设计了指令引导语言模块(IGL)和外观语义视觉模块(ASV),以增强视觉与语言语义学习。在记忆机制方面,提出了全局自适应聚合模块(GAA)用于显式全景观测融合,并引入循环记忆融合模块(RMF)以提供隐式时间状态。在R2R和REVERIE数据集上的大量实验结果表明,本方法相较于现有方法取得了更优性能。