Semantic localization, i.e., robot self-localization with semantic image modality, is critical in recently emerging embodied AI applications such as point-goal navigation, object-goal navigation and vision language navigation. However, most existing works on semantic localization focus on passive vision tasks without viewpoint planning, or rely on additional rich modalities (e.g., depth measurements). Thus, the problem is largely unsolved. In this work, we explore a lightweight, entirely CPU-based, domain-adaptive semantic localization framework, called graph neural localizer.Our approach is inspired by two recently emerging technologies: (1) Scene graph, which combines the viewpoint- and appearance- invariance of local and global features; (2) Graph neural network, which enables direct learning/recognition of graph data (i.e., non-vector data). Specifically, a graph convolutional neural network is first trained as a scene graph classifier for passive vision, and then its knowledge is transferred to a reinforcement-learning planner for active vision. Experiments on two scenarios, self-supervised learning and unsupervised domain adaptation, using a photo-realistic Habitat simulator validate the effectiveness of the proposed method.
翻译:语义定位,即机器人基于语义图像模态的自我定位,在最近兴起的具身人工智能应用中(如点目标导航、物体目标导航和视觉语言导航)至关重要。然而,现有语义定位研究多聚焦于无视角规划的被动视觉任务,或依赖额外丰富的模态信息(例如深度测量),因此该问题仍未得到充分解决。本文探索了一种轻量级、纯CPU计算、领域自适应的语义定位框架,称为图神经定位器。我们的方法受两项新兴技术启发:(1) 场景图——结合了局部与全局特征的视角不变性和外观不变性;(2) 图神经网络——支持对图数据(即非向量数据)的直接学习与识别。具体而言,首先训练一个图卷积神经网络作为被动视觉场景图分类器,随后将其知识迁移至强化学习规划器以支持主动视觉。在照片级真实感Habitat模拟器上进行的自监督学习和无监督领域适应两项实验验证了所提方法的有效性。