Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes the incorporation of ultrasound tongue images to improve the performance of lip-based AV-SE systems further. To address the challenge of acquiring ultrasound tongue images during inference, we first propose to employ knowledge distillation during training to investigate the feasibility of leveraging tongue-related information without directly inputting ultrasound tongue images. Specifically, we guide an audio-lip speech enhancement student model to learn from a pre-trained audio-lip-tongue speech enhancement teacher model, thus transferring tongue-related knowledge. To better model the alignment between the lip and tongue modalities, we further propose the introduction of a lip-tongue key-value memory network into the AV-SE model. This network enables the retrieval of tongue features based on readily available lip features, thereby assisting the subsequent speech enhancement task. Experimental results demonstrate that both methods significantly improve the quality and intelligibility of the enhanced speech compared to traditional lip-based AV-SE baselines. Moreover, both proposed methods exhibit strong generalization performance on unseen speakers and in the presence of unseen noises. Furthermore, phone error rate (PER) analysis of automatic speech recognition (ASR) reveals that while all phonemes benefit from introducing ultrasound tongue images, palatal and velar consonants benefit most.
翻译:音视频语音增强(AV-SE)旨在利用额外视觉信息(如唇部视频)提升退化语音质量,已被证明比纯音频语音增强更有效。本文提出引入超声舌图像来进一步提升基于唇部的AV-SE系统性能。为解决推理阶段获取超声舌图像的挑战,我们首先提出在训练阶段采用知识蒸馏方法,探究不直接输入超声舌图像即可利用舌部相关信息的可行性。具体而言,我们引导一个音频-唇部语音增强学生模型向预训练的音频-唇部-舌部语音增强教师模型学习,从而迁移舌部相关知识。为更好建模唇部和舌部模态间的对齐关系,我们进一步提出在AV-SE模型中引入唇-舌键值记忆网络。该网络能够基于易于获取的唇部特征检索舌部特征,从而辅助后续的语音增强任务。实验结果表明,与传统的基于唇部的AV-SE基线相比,两种方法均显著提升了增强语音的质量和可懂度。此外,两种提出方法在未见说话人和未知噪声场景下均展现出强大的泛化性能。自动语音识别(ASR)的词错误率(PER)分析进一步揭示:虽然所有音素均受益于超声舌图像的引入,但腭音和软腭辅音的改进最为显著。