Reverberation not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition. Prior work attempts to remove reverberation based on the audio modality only. Our idea is to learn to dereverberate speech from audio-visual observations. The visual environment surrounding a human speaker reveals important cues about the room geometry, materials, and speaker location, all of which influence the precise reverberation effects. We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed monaural sound and visual scene. In support of this new task, we develop a large-scale dataset SoundSpaces-Speech that uses realistic acoustic renderings of speech in real-world 3D scans of homes offering a variety of room acoustics. Demonstrating our approach on both simulated and real imagery for speech enhancement, speech recognition, and speaker identification, we show it achieves state-of-the-art performance and substantially improves over audio-only methods.
翻译:混响不仅降低了人类感知的语音质量,还严重影响了自动语音识别的准确性。以往的工作仅基于音频模态尝试去除混响。我们的想法是学习从音频-视觉观测中去除语音混响。人类说话者周围的视觉环境揭示了关于房间几何结构、材料以及说话者位置的重要线索,所有这些都会影响具体的混响效果。我们引入了音频的视觉信息去混响(VIDA),这是一种端到端的方法,能够基于观测到的单声道声音和视觉场景学习去除混响。为支持这一新任务,我们构建了一个大规模数据集SoundSpaces-Speech,该数据集利用真实世界家庭环境的3D扫描中的语音进行逼真的声学渲染,提供了多种房间声学场景。通过在模拟和真实图像上对语音增强、语音识别和说话者识别进行演示,我们展示了该方法达到了最先进的性能,并显著优于纯音频方法。