In this paper, we introduce a novel image-goal navigation approach, named RFSG. Our focus lies in leveraging the fine-grained connections between goals, observations, and the environment within limited image data, all the while keeping the navigation architecture simple and lightweight. To this end, we propose the spatial-channel attention mechanism, enabling the network to learn the importance of multi-dimensional features to fuse the goal and observation features. In addition, a selfdistillation mechanism is incorporated to further enhance the feature representation capabilities. Given that the navigation task needs surrounding environmental information for more efficient navigation, we propose an image scene graph to establish feature associations at both the image and object levels, effectively encoding the surrounding scene information. Crossscene performance validation was conducted on the Gibson and HM3D datasets, and the proposed method achieved stateof-the-art results among mainstream methods, with a speed of up to 53.5 frames per second on an RTX3080. This contributes to the realization of end-to-end image-goal navigation in realworld scenarios. The implementation and model of our method have been released at: https://github.com/nubot-nudt/RFSG.
翻译:本文提出了一种新颖的图像目标导航方法,命名为RFSG。我们的研究重点在于利用有限图像数据中目标、观测与环境之间的细粒度关联,同时保持导航架构的简洁与轻量化。为此,我们提出了空间-通道注意力机制,使网络能够学习多维特征的重要性,从而融合目标与观测特征。此外,我们引入了自蒸馏机制以进一步增强特征表示能力。考虑到导航任务需要周围环境信息以实现更高效的导航,我们提出了一种图像场景图,在图像和物体两个层面建立特征关联,有效编码周围场景信息。我们在Gibson和HM3D数据集上进行了跨场景性能验证,所提方法在主流方法中取得了最先进的性能,在RTX3080显卡上的运行速度可达每秒53.5帧。这有助于实现真实场景下的端到端图像目标导航。本方法的实现代码与模型已发布于:https://github.com/nubot-nudt/RFSG。