Visual-audio navigation (VAN) is attracting more and more attention from the robotic community due to its broad applications, \emph{e.g.}, household robots and rescue robots. In this task, an embodied agent must search for and navigate to the sound source with egocentric visual and audio observations. However, the existing methods are limited in two aspects: 1) poor generalization to unheard sound categories; 2) sample inefficient in training. Focusing on these two problems, we propose a brain-inspired plug-and-play method to learn a semantic-agnostic and spatial-aware representation for generalizable visual-audio navigation. We meticulously design two auxiliary tasks for respectively accelerating learning representations with the above-desired characteristics. With these two auxiliary tasks, the agent learns a spatially-correlated representation of visual and audio inputs that can be applied to work on environments with novel sounds and maps. Experiment results on realistic 3D scenes (Replica and Matterport3D) demonstrate that our method achieves better generalization performance when zero-shot transferred to scenes with unseen maps and unheard sound categories.
翻译:视觉-音频导航(VAN)因其在家庭机器人和救援机器人等领域的广泛应用而受到机器人学界越来越多的关注。在该任务中,具身智能体需利用第一人称视觉与音频观测搜索并导航至声源位置。然而现有方法存在两方面局限:1)对未见声音类别的泛化能力不足;2)训练样本效率低下。针对这两大问题,我们提出一种受脑启发的即插即用方法,以学习面向通用视觉-音频导航的语义无关与空间感知表征。我们精心设计了两项辅助任务,分别加速具有上述期望特性的表征学习。通过这两项辅助任务,智能体能够学习视觉与音频输入的空间关联表征,该表征可应用于包含新颖声音与地图的环境。在真实三维场景(Replica和Matterport3D)上的实验结果表明,当零样本迁移至包含未见地图与未听声音类别的场景时,我们的方法实现了更优的泛化性能。