Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging not only the audio itself but also the target speaker's lip movements. This approach has been shown to yield improvements over audio-only speech enhancement, particularly for the removal of interfering speech. Despite recent advances in speech synthesis, most audio-visual approaches continue to use spectral mapping/masking to reproduce the clean audio, often resulting in visual backbones added to existing speech enhancement architectures. In this work, we propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture, and then converts them into waveform audio using a neural vocoder (HiFi-GAN). We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference. Our experiments show that LA-VocE outperforms existing methods according to multiple metrics, particularly under very noisy scenarios.
翻译:音视频语音增强旨在通过结合音频本身以及目标说话人的唇部运动,从嘈杂环境中提取干净语音。与纯音频语音增强相比,此方法已被证明能带来改进,尤其是在消除干扰语音方面。尽管语音合成领域近期取得进展,但大多数音视频方法仍沿用频谱映射/掩蔽来重建干净音频,这通常导致在现有语音增强架构上添加视觉主干。在本工作中,我们提出LA-VocE,一种新的两阶段方法:首先通过基于Transformer的架构从含噪音视频语音预测梅尔频谱图,然后使用神经声码器(HiFi-GAN)将其转换为波形音频。我们在数千名说话人和11种以上不同语言上训练和评估该框架,并研究模型适应不同背景噪声和语音干扰水平的能力。实验表明,LA-VocE在多个指标上优于现有方法,尤其在极低信噪比场景下表现突出。