Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes further incorporating ultrasound tongue images to improve lip-based AV-SE systems' performance. Knowledge distillation is employed at the training stage to address the challenge of acquiring ultrasound tongue images during inference, enabling an audio-lip speech enhancement student model to learn from a pre-trained audio-lip-tongue speech enhancement teacher model. Experimental results demonstrate significant improvements in the quality and intelligibility of the speech enhanced by the proposed method compared to the traditional audio-lip speech enhancement baselines. Further analysis using phone error rates (PER) of automatic speech recognition (ASR) shows that palatal and velar consonants benefit most from the introduction of ultrasound tongue images.
翻译:视听语音增强(AV-SE)旨在结合额外视觉信息(如唇部视频)来增强降质语音,已证明比纯音频语音增强更有效。本文提出进一步融入超声舌图像以提升基于唇部的AV-SE系统性能。在训练阶段采用知识蒸馏,以解决推理时获取超声舌图像的挑战,使音频-唇部语音增强学生模型能够从预训练的音频-唇部-舌头语音增强教师模型中学习。实验结果表明,与传统的音频-唇部语音增强基线相比,所提方法在增强语音的质量和可懂度上取得了显著提升。通过自动语音识别(ASR)的字母错误率(PER)进一步分析显示,腭音和软腭音从超声舌图像的引入中获益最大。