This paper addresses the challenge of fine-grained alignment in Vision-and-Language Navigation (VLN) tasks, where robots navigate realistic 3D environments based on natural language instructions. Current approaches use contrastive learning to align language with visual trajectory sequences. Nevertheless, they encounter difficulties with fine-grained vision negatives. To enhance cross-modal embeddings, we introduce a novel Bayesian Optimization-based adversarial optimization framework for creating fine-grained contrastive vision samples. To validate the proposed methodology, we conduct a series of experiments to assess the effectiveness of the enriched embeddings on fine-grained vision negatives. We conduct experiments on two common VLN benchmarks R2R and REVERIE, experiments on the them demonstrate that these embeddings benefit navigation, and can lead to a promising performance enhancement. Our source code and trained models are available at: https://anonymous.4open.science/r/FGVLN.
翻译:本文针对视觉语言导航任务中的细粒度对齐挑战展开研究,该任务要求机器人依据自然语言指令在真实三维环境中进行导航。现有方法通常采用对比学习实现语言与视觉轨迹序列的对齐,但在处理细粒度视觉负样本时仍面临困难。为增强跨模态嵌入表示,我们提出了一种基于贝叶斯优化的新型对抗优化框架,用于生成细粒度对比视觉样本。为验证所提方法的有效性,我们设计了一系列实验评估增强嵌入在细粒度视觉负样本上的表现。通过在R2R和REVERIE两个主流VLN基准数据集上的实验表明,该嵌入表示能有效提升导航性能,并带来显著的性能改进。我们的源代码与训练模型已公开于:https://anonymous.4open.science/r/FGVLN。