Semantic localization, i.e., robot self-localization with semantic image modality, is critical in recently emerging embodied AI applications such as point-goal navigation, object-goal navigation and vision language navigation. However, most existing works on semantic localization focus on passive vision tasks without viewpoint planning, or rely on additional rich modalities (e.g., depth measurements). Thus, the problem is largely unsolved. In this work, we explore a lightweight, entirely CPU-based, domain-adaptive semantic localization framework, called graph neural localizer.Our approach is inspired by two recently emerging technologies: (1) Scene graph, which combines the viewpoint- and appearance- invariance of local and global features; (2) Graph neural network, which enables direct learning/recognition of graph data (i.e., non-vector data). Specifically, a graph convolutional neural network is first trained as a scene graph classifier for passive vision, and then its knowledge is transferred to a reinforcement-learning planner for active vision. Experiments on two scenarios, self-supervised learning and unsupervised domain adaptation, using a photo-realistic Habitat simulator validate the effectiveness of the proposed method.
翻译:语义定位(即利用语义图像模态的机器人自定位)在近期涌现的具身智能应用(如点目标导航、物体目标导航及视觉语言导航)中至关重要。然而,现有语义定位研究大多聚焦于无视角规划的被动视觉任务,或依赖额外丰富模态(如深度测量),因此该问题尚未得到有效解决。本文提出一种轻量级、完全基于CPU的域自适应语义定位框架——图神经定位器。我们的方法受两项新兴技术启发:(1)场景图,其融合了局部与全局特征中视角不变性与外观不变性;(2)图神经网络,可直接学习/识别图数据(即非向量数据)。具体而言,首先训练图卷积神经网络作为被动视觉场景图分类器,随后将其知识迁移至强化学习规划器以实现主动视觉。在基于高逼真度Habitat仿真器的自监督学习与无监督域适应两种场景下的实验验证了该方法的有效性。