Sim2real transfer has received increasing attention lately due to the success of learning robotic tasks in simulation end-to-end. While there has been a lot of progress in transferring vision-based navigation policies, the existing sim2real strategy for audio-visual navigation performs data augmentation empirically without measuring the acoustic gap. The sound differs from light in that it spans across much wider frequencies and thus requires a different solution for sim2real. We propose the first treatment of sim2real for audio-visual navigation by disentangling it into acoustic field prediction (AFP) and waypoint navigation. We first validate our design choice in the SoundSpaces simulator and show improvement on the Continuous AudioGoal navigation benchmark. We then collect real-world data to measure the spectral difference between the simulation and the real world by training AFP models that only take a specific frequency subband as input. We further propose a frequency-adaptive strategy that intelligently selects the best frequency band for prediction based on both the measured spectral difference and the energy distribution of the received audio, which improves the performance on the real data. Lastly, we build a real robot platform and show that the transferred policy can successfully navigate to sounding objects. This work demonstrates the potential of building intelligent agents that can see, hear, and act entirely from simulation, and transferring them to the real world.
翻译:近期,由于在模拟环境中端到端学习机器人任务的成功,模拟到现实(sim2real)迁移受到越来越多的关注。尽管视觉导航策略的迁移已取得诸多进展,但现有的视听导航模拟-现实策略仅凭经验进行数据增强,并未量化声学差异。声音与光不同,其跨越更宽的频率范围,因此需要不同的模拟-现实解决方案。我们首次提出通过将模拟-现实迁移分解为声场预测(AFP)和航点导航两个子任务来处理视听导航问题。首先在SoundSpaces模拟器中验证了设计选择,并在连续音频目标导航基准上取得了性能提升。随后通过采集真实世界数据,训练仅以特定频率子带作为输入的AFP模型,测量模拟环境与真实世界之间的频谱差异。进一步提出频率自适应策略,根据测量的频谱差异和接收音频的能量分布智能选择最佳频率频带进行预测,从而提升了在真实数据上的性能。最后构建真实机器人平台,证明迁移的策略能够成功导航至发声物体。本工作展示了构建完全基于模拟环境且具备看、听、行动能力的智能体,并将其迁移至现实世界的潜力。