Deep Reinforcement Learning (DRL) has shown great potential in enabling robots to find certain objects (e.g., `find a fridge') in environments like homes or schools. This task is known as Object-Goal Navigation (ObjectNav). DRL methods are predominantly trained and evaluated using environment simulators. Although DRL has shown impressive results, the simulators may be biased or limited. This creates a risk of shortcut learning, i.e., learning a policy tailored to specific visual details of training environments. We aim to deepen our understanding of shortcut learning in ObjectNav, its implications and propose a solution. We design an experiment for inserting a shortcut bias in the appearance of training environments. As a proof-of-concept, we associate room types to specific wall colors (e.g., bedrooms with green walls), and observe poor generalization of a state-of-the-art (SOTA) ObjectNav method to environments where this is not the case (e.g., bedrooms with blue walls). We find that shortcut learning is the root cause: the agent learns to navigate to target objects, by simply searching for the associated wall color of the target object's room. To solve this, we propose Language-Based (L-B) augmentation. Our key insight is that we can leverage the multimodal feature space of a Vision-Language Model (VLM) to augment visual representations directly at the feature-level, requiring no changes to the simulator, and only an addition of one layer to the model. Where the SOTA ObjectNav method's success rate drops 69%, our proposal has only a drop of 23%.
翻译:深度强化学习(DRL)在使机器人于家庭或学校等环境中定位特定物体(例如“找到冰箱”)方面展现出巨大潜力,该任务被称为目标导向导航(ObjectNav)。当前DRL方法主要依赖环境模拟器进行训练与评估。尽管DRL取得了显著成果,但模拟器可能存在偏差或局限性,这带来了捷径学习的风险——即学习一种针对训练环境特定视觉细节的策略。我们旨在深化对目标导向导航中捷径学习的理解、探究其影响并提出解决方案。我们设计了一项实验,在训练环境外观中植入捷径偏差。作为概念验证,我们将房间类型与特定墙面颜色关联(例如卧室对应绿色墙面),并观察到当测试环境不符合此规律时(例如蓝色墙面的卧室),最先进的(SOTA)目标导向导航方法泛化能力显著降低。研究发现捷径学习是根本原因:智能体通过简单搜索目标物体所在房间的关联墙面颜色来学习导航。为解决此问题,我们提出基于语言(L-B)的数据增强方法。核心见解在于:可利用视觉-语言模型(VLM)的多模态特征空间,直接在特征层级增强视觉表征,无需修改模拟器,仅需在模型中增加一个网络层。当SOTA目标导向导航方法的成功率下降69%时,我们的方法仅下降23%。