Most existing works solving Room-to-Room VLN problem only utilize RGB images and do not consider local context around candidate views, which lack sufficient visual cues about surrounding environment. Moreover, natural language contains complex semantic information thus its correlations with visual inputs are hard to model merely with cross attention. In this paper, we propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation. The RGB images are compensated with the corresponding depth maps and normal maps predicted by Omnidata as visual inputs. Technically, we introduce a two-stage module that combine local slot attention and CLIP model to produce geometry-enhanced representation from such input. We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations. Additionally, a novel multiway attention module is designed, encouraging different phrases of input instruction to exploit the most related features from visual input. Extensive experiments demonstrate the effectiveness of our newly designed modules and show the compelling performance of the proposed method.
翻译:为解决房间到房间视觉语言导航问题,现有方法大多仅利用RGB图像而忽略候选视图周围的局部上下文,导致缺乏环境周围的充分视觉线索。此外,自然语言包含复杂语义信息,单纯依靠交叉注意力难以建模其与视觉输入的相关性。本文提出GeoVLN方法,基于槽注意力学习几何增强视觉表示以实现鲁棒的视觉语言导航。该方法以Omnidata预测的深度图和法向图作为补充,与RGB图像共同构成视觉输入。技术层面,我们引入两阶段模块,结合局部槽注意力与CLIP模型,从上述输入中生成几何增强表示。采用V&L BERT学习融合语言与视觉信息的跨模态表示。此外,设计新型多路注意力模块,促使输入指令的不同短语从视觉输入中提取最相关特征。大量实验验证了各模块的有效性,并展示了所提方法的卓越性能。