Visual-audio navigation (VAN) is attracting more and more attention from the robotic community due to its broad applications, \emph{e.g.}, household robots and rescue robots. In this task, an embodied agent must search for and navigate to the sound source with egocentric visual and audio observations. However, the existing methods are limited in two aspects: 1) poor generalization to unheard sound categories; 2) sample inefficient in training. Focusing on these two problems, we propose a brain-inspired plug-and-play method to learn a semantic-agnostic and spatial-aware representation for generalizable visual-audio navigation. We meticulously design two auxiliary tasks for respectively accelerating learning representations with the above-desired characteristics. With these two auxiliary tasks, the agent learns a spatially-correlated representation of visual and audio inputs that can be applied to work on environments with novel sounds and maps. Experiment results on realistic 3D scenes (Replica and Matterport3D) demonstrate that our method achieves better generalization performance when zero-shot transferred to scenes with unseen maps and unheard sound categories.
翻译:视听导航因其在家庭机器人和救援机器人等领域的广泛应用,正日益受到机器人界的关注。在该任务中,具身智能体需通过自我中心的视觉与音频观测搜索并导航至声源。然而,现有方法存在两大局限:1)对未听过声音类别的泛化能力不足;2)训练样本效率低下。针对这两个问题,我们提出了一种受脑启发的即插即用方法,用于学习语义无关且空间感知的表达,以实现可泛化的视听导航。我们精心设计了两个辅助任务,分别加速学习具有上述理想特性的表达。通过这两个辅助任务,智能体能学习视觉与音频输入的空间关联表达,并适用于包含新声音及新地图的环境。在真实三维场景(Replica和Matterport3D)上的实验结果表明,当零样本迁移至包含未见地图和未听声音类别的场景时,我们的方法展现出更优的泛化性能。