Seeing is Believing? Enhancing Vision-Language Navigation using Visual Perturbations

Autonomous navigation for an embodied agent guided by natural language instructions remains a formidable challenge in vision-and-language navigation (VLN). Despite remarkable recent progress in learning fine-grained and multifarious visual representations, the tendency to overfit to the training environments leads to unsatisfactory generalization performance. In this work, we present a versatile Multi-Branch Architecture (MBA) aimed at exploring and exploiting diverse visual inputs. Specifically, we introduce three distinct visual variants: ground-truth depth images, visual inputs integrated with incongruent views, and those infused with random noise to enrich the diversity of visual input representation and prevent overfitting to the original RGB observations. To adaptively fuse these varied inputs, the proposed MBA extend a base agent model into a multi-branch variant, where each branch processes a different visual input. Surprisingly, even random noise can further enhance navigation performance in unseen environments. Extensive experiments conducted on three VLN benchmarks (R2R, REVERIE, SOON) demonstrate that our proposed method equals or even surpasses state-of-the-art results. The source code will be publicly available.

翻译：在视觉语言导航（VLN）领域，由自然语言指令引导的具身智能体自主导航仍然是一项艰巨的挑战。尽管在学习细粒度和多样化的视觉表征方面取得了显著进展，但模型倾向于过拟合训练环境，导致泛化性能不尽人意。在本工作中，我们提出了一种通用的多分支架构（MBA），旨在探索和利用多样化的视觉输入。具体而言，我们引入了三种不同的视觉变体：真实深度图像、融合了不一致视角的视觉输入，以及注入了随机噪声的视觉输入，以丰富视觉输入表征的多样性，并防止对原始RGB观测的过拟合。为了自适应地融合这些不同的输入，所提出的MBA将基础智能体模型扩展为一个多分支变体，其中每个分支处理一种不同的视觉输入。令人惊讶的是，即使是随机噪声也能进一步提升在未见环境中的导航性能。在三个VLN基准测试（R2R、REVERIE、SOON）上进行的大量实验表明，我们提出的方法达到甚至超越了最先进的结果。源代码将公开提供。

相关内容

工商管理硕士（MBA）

关注 3

MBA是英文Master of Business Administration（工商管理硕士）的简称，而其中文简称为“工管硕”。工管硕士是源于欧美国家的一种专门培养中高级职业经理人员的专业硕士学位。工管硕士是市场经济的产物，培养的是高素质的管理人员、职业经理人和创业者。工管硕士是商业界普遍认为是晋身管理阶层的一块垫脚石。现时不少学校为了开拓财源增加收入，都与世界知名大学商学院学术合作，销售他们的工商管理硕士课程。工管硕士培养的是高质量的职业工商管理人才，使他们掌握生产、财务、金融、营销、经济法规、国际商务等多学科知识
和管理技能。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日