Visual place classification from a first-person-view monocular RGB image is a fundamental problem in long-term robot navigation. A difficulty arises from the fact that RGB image classifiers are often vulnerable to spatial and appearance changes and degrade due to domain shifts, such as seasonal, weather, and lighting differences. To address this issue, multi-sensor fusion approaches combining RGB and depth (D) (e.g., LIDAR, radar, stereo) have gained popularity in recent years. Inspired by these efforts in multimodal RGB-D fusion, we explore the use of pseudo-depth measurements from recently-developed techniques of ``domain invariant" monocular depth estimation as an additional pseudo depth modality, by reformulating the single-modal RGB image classification task as a pseudo multi-modal RGB-D classification problem. Specifically, a practical, fully self-supervised framework for training, appropriately processing, fusing, and classifying these two modalities, RGB and pseudo-D, is described. Experiments on challenging cross-domain scenarios using public NCLT datasets validate effectiveness of the proposed framework.
翻译:从第一人称视角单目RGB图像进行视觉地点分类是长期机器人导航中的基础问题。由于RGB图像分类器常对空间与外观变化敏感,且因季节、天气、光照差异等域偏移导致性能退化,该问题面临困难。为应对这一挑战,近年来融合RGB与深度信息(如激光雷达、雷达、立体视觉)的多传感器融合方法日益流行。受多模态RGB-D融合研究的启发,我们探索将近期“域不变”单目深度估计技术生成的伪深度测量值作为附加伪深度模态,通过将单模态RGB图像分类任务重构为伪多模态RGB-D分类问题加以利用。具体而言,本文描述了一个完全自监督的实用框架,用于训练、合理处理、融合及分类RGB与伪深度这两种模态。基于公开NCLT数据集的跨域挑战场景实验验证了所提框架的有效性。