Vision-and-Language Navigation (VLN) tasks have gained prominence within artificial intelligence research due to their potential application in fields like home assistants. Many contemporary VLN approaches, while based on transformer architectures, have increasingly incorporated additional components such as external knowledge bases or map information to enhance performance. These additions, while boosting performance, also lead to larger models and increased computational costs. In this paper, to achieve both high performance and low computational costs, we propose a novel architecture with the COmbination of Selective MemOrization (COSMO). Specifically, COSMO integrates state-space modules and transformer modules, and incorporates two VLN-customized selective state space modules: the Round Selective Scan (RSS) and the Cross-modal Selective State Space Module (CS3). RSS facilitates comprehensive inter-modal interactions within a single scan, while the CS3 module adapts the selective state space module into a dual-stream architecture, thereby enhancing the acquisition of cross-modal interactions. Experimental validations on three mainstream VLN benchmarks, REVERIE, R2R, and R2R-CE, not only demonstrate competitive navigation performance of our model but also show a significant reduction in computational costs.
翻译:视觉语言导航(VLN)任务因其在家庭助手等领域的潜在应用价值,已成为人工智能研究的热点。当前许多VLN方法虽然基于Transformer架构,但为了提升性能,越来越多地引入了外部知识库或地图信息等额外组件。这些附加组件在提升性能的同时,也导致模型规模扩大和计算成本增加。本文为实现高性能与低计算成本的双重目标,提出了一种融合选择性记忆组合(COSMO)的新型架构。具体而言,COSMO整合了状态空间模块与Transformer模块,并引入了两个针对VLN任务定制的选择性状态空间模块:循环选择性扫描(RSS)和跨模态选择性状态空间模块(CS3)。RSS通过单次扫描实现全面的跨模态交互,而CS3模块则将选择性状态空间模块适配为双流架构,从而增强跨模态交互的获取能力。在REVERIE、R2R和R2R-CE三个主流VLN基准上的实验验证表明,我们的模型不仅展现出具有竞争力的导航性能,同时显著降低了计算成本。