Visual place classification from a first-person-view monocular RGB image is a fundamental problem in long-term robot navigation. A difficulty arises from the fact that RGB image classifiers are often vulnerable to spatial and appearance changes and degrade due to domain shifts, such as seasonal, weather, and lighting differences. To address this issue, multi-sensor fusion approaches combining RGB and depth (D) (e.g., LIDAR, radar, stereo) have gained popularity in recent years. Inspired by these efforts in multimodal RGB-D fusion, we explore the use of pseudo-depth measurements from recently-developed techniques of ``domain invariant" monocular depth estimation as an additional pseudo depth modality, by reformulating the single-modal RGB image classification task as a pseudo multi-modal RGB-D classification problem. Specifically, a practical, fully self-supervised framework for training, appropriately processing, fusing, and classifying these two modalities, RGB and pseudo-D, is described. Experiments on challenging cross-domain scenarios using public NCLT datasets validate effectiveness of the proposed framework.
翻译:从第一人称视角单目RGB图像进行视觉地点分类是长期机器人导航中的一个基本问题。其难点在于RGB图像分类器通常易受空间和外观变化的影响,并因季节、天气和光照差异等域偏移而性能下降。为解决此问题,近年来结合RGB和深度(D)(如激光雷达、雷达、立体视觉)的多传感器融合方法日益普及。受这些多模态RGB-D融合研究的启发,我们将近期发展的“域不变”单目深度估计技术生成的伪深度测量值作为一种额外的伪深度模态,通过将单模态RGB图像分类任务重构为伪多模态RGB-D分类问题来探索其应用。具体而言,本文描述了一个用于训练、适当处理、融合及分类这两种模态(RGB和伪深度)的实用全自监督框架。在公共NCLT数据集上针对挑战性跨域场景的实验验证了所提框架的有效性。