We present AdVerb, a novel audio-visual dereverberation framework that uses visual cues in addition to the reverberant sound to estimate clean audio. Although audio-only dereverberation is a well-studied problem, our approach incorporates the complementary visual modality to perform audio dereverberation. Given an image of the environment where the reverberated sound signal has been recorded, AdVerb employs a novel geometry-aware cross-modal transformer architecture that captures scene geometry and audio-visual cross-modal relationship to generate a complex ideal ratio mask, which, when applied to the reverberant audio predicts the clean sound. The effectiveness of our method is demonstrated through extensive quantitative and qualitative evaluations. Our approach significantly outperforms traditional audio-only and audio-visual baselines on three downstream tasks: speech enhancement, speech recognition, and speaker verification, with relative improvements in the range of 18% - 82% on the LibriSpeech test-clean set. We also achieve highly satisfactory RT60 error scores on the AVSpeech dataset.
翻译:我们提出AdVerb,一种新颖的视听去混响框架,它除了利用混响声音外,还通过视觉线索来估计纯净音频。尽管纯音频去混响是一个研究成熟的课题,我们的方法引入了互补的视觉模态来执行音频去混响。给定录制混响声音信号的环境图像,AdVerb采用一种新颖的几何感知跨模态变换器架构,该架构捕获场景几何特征及视听跨模态关系,以生成复理想比值掩码。将该掩码应用于混响音频即可预测出纯净声音。通过广泛的定量和定性评估,我们展示了该方法的效果。在三个下游任务(语音增强、语音识别和说话人验证)上,我们的方法显著优于传统的纯音频和视听基线方法,在LibriSpeech测试纯净集上的相对改进幅度为18%至82%。我们还在AVSpeech数据集上取得了高度令人满意的RT60误差分数。