Visual object navigation using learning methods is one of the key tasks in mobile robotics. This paper introduces a new representation of a scene semantic map formed during the embodied agent interaction with the indoor environment. It is based on a neural network method that adjusts the weights of the segmentation model with backpropagation of the predicted fusion loss values during inference on a regular (backward) or delayed (forward) image sequence. We have implemented this representation into a full-fledged navigation approach called SkillTron, which can select robot skills from end-to-end policies based on reinforcement learning and classic map-based planning methods. The proposed approach makes it possible to form both intermediate goals for robot exploration and the final goal for object navigation. We conducted intensive experiments with the proposed approach in the Habitat environment, which showed a significant superiority in navigation quality metrics compared to state-of-the-art approaches. The developed code and used custom datasets are publicly available at github.com/AIRI-Institute/skill-fusion.
翻译:基于学习方法的视觉物体导航是移动机器人领域的关键任务之一。本文提出了一种新的场景语义地图表征方法,该方法在具身智能体与室内环境交互过程中形成。其核心是一种神经网络方法,通过在前向或延迟图像序列推理过程中反向传播预测融合损失值来调整分割模型权重。我们将该表征集成到名为SkillTron的完整导航框架中,该框架可从基于强化学习的端到端策略与经典地图规划方法中选择机器人技能。所提方法既能构建机器人探索的中间目标,也能形成物体导航的最终目标。我们在Habitat环境中开展了大量实验,结果表明本方法在导航质量指标上显著优于现有最优方法。开发代码及使用的自定义数据集已公开于github.com/AIRI-Institute/skill-fusion。