Semantic localization, i.e., robot self-localization with semantic image modality, is critical in recently emerging embodied AI applications (e.g., point-goal navigation, object-goal navigation, vision language navigation) and topological mapping applications (e.g., graph neural SLAM, ego-centric topological map). However, most existing works on semantic localization focus on passive vision tasks without viewpoint planning, or rely on additional rich modalities (e.g., depth measurements). Thus, the problem is largely unsolved. In this work, we explore a lightweight, entirely CPU-based, domain-adaptive semantic localization framework, called graph neural localizer. Our approach is inspired by two recently emerging technologies: (1) Scene graph, which combines the viewpoint- and appearance- invariance of local and global features; (2) Graph neural network, which enables direct learning/recognition of graph data (i.e., non-vector data). Specifically, a graph convolutional neural network is first trained as a scene graph classifier for passive vision, and then its knowledge is transferred to a reinforcement-learning planner for active vision. Experiments on two scenarios, self-supervised learning and unsupervised domain adaptation, using a photo-realistic Habitat simulator validate the effectiveness of the proposed method.
翻译:语义定位,即利用语义图像模态实现机器人自定位,在最近兴起的具身人工智能应用(如点目标导航、物体目标导航、视觉语言导航)和拓扑映射应用(如图神经SLAM、自我中心拓扑地图)中至关重要。然而,现有大多数语义定位工作主要聚焦于无视角规划的被动视觉任务,或依赖额外的丰富模态(如深度测量)。因此,该问题在很大程度上尚未得到解决。在本工作中,我们探索了一种轻量级、完全基于CPU、领域自适应的语义定位框架,称为图神经定位器。我们的方法受两项新兴技术启发:(1)场景图,它结合了局部与全局特征对视角和外观的不变性;(2)图神经网络,它能够直接学习/识别图数据(即非向量数据)。具体而言,首先训练一个图卷积神经网络作为场景图分类器用于被动视觉,然后将其知识迁移至强化学习规划器以实现主动视觉。使用高保真Habitat模拟器在两种场景(自监督学习和无监督领域自适应)下进行的实验验证了所提方法的有效性。