Developing embodied agents in simulation has been a key research topic in recent years. Exciting new tasks, algorithms, and benchmarks have been developed in various simulators. However, most of them assume deaf agents in silent environments, while we humans perceive the world with multiple senses. We introduce Sonicverse, a multisensory simulation platform with integrated audio-visual simulation for training household agents that can both see and hear. Sonicverse models realistic continuous audio rendering in 3D environments in real-time. Together with a new audio-visual VR interface that allows humans to interact with agents with audio, Sonicverse enables a series of embodied AI tasks that need audio-visual perception. For semantic audio-visual navigation in particular, we also propose a new multi-task learning model that achieves state-of-the-art performance. In addition, we demonstrate Sonicverse's realism via sim-to-real transfer, which has not been achieved by other simulators: an agent trained in Sonicverse can successfully perform audio-visual navigation in real-world environments. Sonicverse is available at: https://github.com/StanfordVL/Sonicverse.
翻译:近年来,在仿真环境中开发具身智能体已成为关键研究课题。各类仿真器中涌现出令人振奋的新任务、新算法和新基准。然而,大多数仿真器默认智能体处于静默环境中,而人类实际上通过多种感官感知世界。我们提出Sonicverse——一个集成视听仿真的多感官仿真平台,用于训练既能看又能听的家庭智能体。Sonicverse在3D环境中实现实时逼真连续音频渲染。结合全新的视听虚拟现实接口(允许人类通过音频与智能体交互),Sonicverse支撑了一系列需要视听感知的具身人工智能任务。针对语义视听导航任务,我们还提出一种新的多任务学习模型,实现了最先进的性能。此外,我们通过仿真到现实迁移验证了Sonicverse的真实性——其他仿真器尚未实现此突破:在Sonicverse中训练的智能体可在真实世界环境中成功执行视听导航。Sonicverse开源地址:https://github.com/StanfordVL/Sonicverse。