Audio-visual speech contains synchronized audio and visual information that provides cross-modal supervision to learn representations for both automatic speech recognition (ASR) and visual speech recognition (VSR). We introduce continuous pseudo-labeling for audio-visual speech recognition (AV-CPL), a semi-supervised method to train an audio-visual speech recognition (AVSR) model on a combination of labeled and unlabeled videos with continuously regenerated pseudo-labels. Our models are trained for speech recognition from audio-visual inputs and can perform speech recognition using both audio and visual modalities, or only one modality. Our method uses the same audio-visual model for both supervised training and pseudo-label generation, mitigating the need for external speech recognition models to generate pseudo-labels. AV-CPL obtains significant improvements in VSR performance on the LRS3 dataset while maintaining practical ASR and AVSR performance. Finally, using visual-only speech data, our method is able to leverage unlabeled visual speech to improve VSR.
翻译:摘要:音视频语音包含同步的音频和视觉信息,可提供跨模态监督,从而学习自动语音识别(ASR)和视觉语音识别(VSR)的表示。我们提出一种面向音视频语音识别的连续伪标签方法(AV-CPL),这是一种半监督方法,通过结合带标签和无标签视频,并持续再生伪标签,来训练音视频语音识别(AVSR)模型。我们的模型通过音视频输入进行语音识别训练,能够同时利用音频和视觉模态或仅使用单一模态进行语音识别。该方法使用相同的音视频模型进行监督训练和伪标签生成,无需外部语音识别模型来生成伪标签。AV-CPL在LRS3数据集上显著提升了VSR性能,同时保持了实用的ASR和AVSR性能。最后,通过仅利用视觉语音数据,我们的方法能够充分利用无标签的视觉语音来改进VSR。