Autonomous navigation for an embodied agent guided by natural language instructions remains a formidable challenge in vision-and-language navigation (VLN). Despite remarkable recent progress in learning fine-grained and multifarious visual representations, the tendency to overfit to the training environments leads to unsatisfactory generalization performance. In this work, we present a versatile Multi-Branch Architecture (MBA) aimed at exploring and exploiting diverse visual inputs. Specifically, we introduce three distinct visual variants: ground-truth depth images, visual inputs integrated with incongruent views, and those infused with random noise to enrich the diversity of visual input representation and prevent overfitting to the original RGB observations. To adaptively fuse these varied inputs, the proposed MBA extend a base agent model into a multi-branch variant, where each branch processes a different visual input. Surprisingly, even random noise can further enhance navigation performance in unseen environments. Extensive experiments conducted on three VLN benchmarks (R2R, REVERIE, SOON) demonstrate that our proposed method equals or even surpasses state-of-the-art results. The source code will be publicly available.
翻译:在视觉语言导航(VLN)领域,由自然语言指令引导的具身智能体自主导航仍然是一项艰巨的挑战。尽管在学习细粒度和多样化的视觉表征方面取得了显著进展,但模型倾向于过拟合训练环境,导致泛化性能不尽人意。在本工作中,我们提出了一种通用的多分支架构(MBA),旨在探索和利用多样化的视觉输入。具体而言,我们引入了三种不同的视觉变体:真实深度图像、融合了不一致视角的视觉输入,以及注入了随机噪声的视觉输入,以丰富视觉输入表征的多样性,并防止对原始RGB观测的过拟合。为了自适应地融合这些不同的输入,所提出的MBA将基础智能体模型扩展为一个多分支变体,其中每个分支处理一种不同的视觉输入。令人惊讶的是,即使是随机噪声也能进一步提升在未见环境中的导航性能。在三个VLN基准测试(R2R、REVERIE、SOON)上进行的大量实验表明,我们提出的方法达到甚至超越了最先进的结果。源代码将公开提供。