Language-Based Augmentation to Address Shortcut Learning in Object Goal Navigation

Deep Reinforcement Learning (DRL) has shown great potential in enabling robots to find certain objects (e.g., `find a fridge') in environments like homes or schools. This task is known as Object-Goal Navigation (ObjectNav). DRL methods are predominantly trained and evaluated using environment simulators. Although DRL has shown impressive results, the simulators may be biased or limited. This creates a risk of shortcut learning, i.e., learning a policy tailored to specific visual details of training environments. We aim to deepen our understanding of shortcut learning in ObjectNav, its implications and propose a solution. We design an experiment for inserting a shortcut bias in the appearance of training environments. As a proof-of-concept, we associate room types to specific wall colors (e.g., bedrooms with green walls), and observe poor generalization of a state-of-the-art (SOTA) ObjectNav method to environments where this is not the case (e.g., bedrooms with blue walls). We find that shortcut learning is the root cause: the agent learns to navigate to target objects, by simply searching for the associated wall color of the target object's room. To solve this, we propose Language-Based (L-B) augmentation. Our key insight is that we can leverage the multimodal feature space of a Vision-Language Model (VLM) to augment visual representations directly at the feature-level, requiring no changes to the simulator, and only an addition of one layer to the model. Where the SOTA ObjectNav method's success rate drops 69%, our proposal has only a drop of 23%.

翻译：深度强化学习（DRL）在使机器人于家庭或学校等环境中定位特定物体（例如“找到冰箱”）方面展现出巨大潜力，该任务被称为目标导向导航（ObjectNav）。当前DRL方法主要依赖环境模拟器进行训练与评估。尽管DRL取得了显著成果，但模拟器可能存在偏差或局限性，这带来了捷径学习的风险——即学习一种针对训练环境特定视觉细节的策略。我们旨在深化对目标导向导航中捷径学习的理解、探究其影响并提出解决方案。我们设计了一项实验，在训练环境外观中植入捷径偏差。作为概念验证，我们将房间类型与特定墙面颜色关联（例如卧室对应绿色墙面），并观察到当测试环境不符合此规律时（例如蓝色墙面的卧室），最先进的（SOTA）目标导向导航方法泛化能力显著降低。研究发现捷径学习是根本原因：智能体通过简单搜索目标物体所在房间的关联墙面颜色来学习导航。为解决此问题，我们提出基于语言（L-B）的数据增强方法。核心见解在于：可利用视觉-语言模型（VLM）的多模态特征空间，直接在特征层级增强视觉表征，无需修改模拟器，仅需在模型中增加一个网络层。当SOTA目标导向导航方法的成功率下降69%时，我们的方法仅下降23%。