Speech enhancement systems are typically trained using pairs of clean and noisy speech. In audio-visual speech enhancement (AVSE), there is not as much ground-truth clean data available; most audio-visual datasets are collected in real-world environments with background noise and reverberation, hampering the development of AVSE. In this work, we introduce AV2Wav, a resynthesis-based audio-visual speech enhancement approach that can generate clean speech despite the challenges of real-world training data. We obtain a subset of nearly clean speech from an audio-visual corpus using a neural quality estimator, and then train a diffusion model on this subset to generate waveforms conditioned on continuous speech representations from AV-HuBERT with noise-robust training. We use continuous rather than discrete representations to retain prosody and speaker information. With this vocoding task alone, the model can perform speech enhancement better than a masking-based baseline. We further fine-tune the diffusion model on clean/noisy utterance pairs to improve the performance. Our approach outperforms a masking-based baseline in terms of both automatic metrics and a human listening test and is close in quality to the target speech in the listening test. Audio samples can be found at https://home.ttic.edu/~jcchou/demo/avse/avse_demo.html.
翻译:语音增强系统通常使用成对的干净和带噪语音进行训练。在音视频语音增强(AVSE)中,可用的干净真实数据相对较少;大多数音视频数据集是在存在背景噪声和混响的真实环境中收集的,这阻碍了AVSE的发展。本文提出了一种基于重合成的音视频语音增强方法AV2Wav,该方法能够在真实训练数据存在挑战的情况下生成干净语音。我们利用神经质量估计器从音视频语料库中获取一个近乎干净的语音子集,然后在该子集上训练扩散模型,以生成基于来自AV-HuBERT的连续语音表示(经过噪声鲁棒训练)的波形。我们使用连续表示而非离散表示来保留韵律和说话人信息。仅通过这一声码化任务,该模型就能实现优于基于掩蔽基线的语音增强性能。我们进一步在干净/带噪语音对上微调扩散模型以提升性能。在自动评估指标和人类听力测试中,我们的方法均优于基于掩蔽的基线方法,且在听力测试中其质量接近目标语音。音频样本可访问https://home.ttic.edu/~jcchou/demo/avse/avse_demo.html。