Cross-modal alignment is one key challenge for Vision-and-Language Navigation (VLN). Most existing studies concentrate on mapping the global instruction or single sub-instruction to the corresponding trajectory. However, another critical problem of achieving fine-grained alignment at the entity level is seldom considered. To address this problem, we propose a novel Grounded Entity-Landmark Adaptive (GELA) pre-training paradigm for VLN tasks. To achieve the adaptive pre-training paradigm, we first introduce grounded entity-landmark human annotations into the Room-to-Room (R2R) dataset, named GEL-R2R. Additionally, we adopt three grounded entity-landmark adaptive pre-training objectives: 1) entity phrase prediction, 2) landmark bounding box prediction, and 3) entity-landmark semantic alignment, which explicitly supervise the learning of fine-grained cross-modal alignment between entity phrases and environment landmarks. Finally, we validate our model on two downstream benchmarks: VLN with descriptive instructions (R2R) and dialogue instructions (CVDN). The comprehensive experiments show that our GELA model achieves state-of-the-art results on both tasks, demonstrating its effectiveness and generalizability.
翻译:跨模态对齐是视觉与语言导航(VLN)的关键挑战之一。现有研究大多集中于将全局指令或单一子指令映射至对应轨迹,然而,实现实体级细粒度对齐的另一关键问题鲜少被关注。为解决此问题,我们提出了一种新颖的具身实体-地标自适应(GELA)预训练范式用于VLN任务。为实现自适应预训练范式,我们首先在Room-to-Room(R2R)数据集中引入具身实体-地标人工标注,命名为GEL-R2R。此外,我们采用三种具身实体-地标自适应预训练目标:1)实体短语预测、2)地标边界框预测、3)实体-地标语义对齐,显式监督实体短语与环境地标之间细粒度跨模态对齐的学习。最终,我们在两个下游基准任务上验证模型:描述性指令导航(R2R)与对话指令导航(CVDN)。全面实验表明,我们的GELA模型在两项任务上均达到最优结果,展示了其有效性与泛化能力。