This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages, especially for low-resource languages that have a limited number of labeled data. Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the different languages without human intervention. To this end, we employ a Whisper model which can conduct both language identification and audio-based speech recognition. It serves to filter data of the desired languages and transcribe labels from the unannotated, multilingual audio-visual data pool. By comparing the performances of VSR models trained on automatic labels and the human-annotated labels, we show that we can achieve similar VSR performance to that of human-annotated labels even without utilizing human annotations. Through the automated labeling process, we label large-scale unlabeled multilingual databases, VoxCeleb2 and AVSpeech, producing 1,002 hours of data for four low VSR resource languages, French, Italian, Spanish, and Portuguese. With the automatic labels, we achieve new state-of-the-art performance on mTEDx in four languages, significantly surpassing the previous methods. The automatic labels are available online: https://github.com/JeongHun0716/Visual-Speech-Recognition-for-Low-Resource-Languages
翻译:本文提出了一种强大的视觉语音识别(VSR)方法,适用于多种语言,特别是标注数据有限的低资源语言。与以往通过利用其他语言知识提升目标语言VSR性能的方法不同,我们探究是否可以在无需人工干预的情况下直接增加不同语言的训练数据量。为此,我们采用能够同时进行语言识别和基于音频的语音识别的Whisper模型,用于从无标注的多语种音视频数据池中过滤所需语言数据并转录标签。通过比较基于自动标签和人工标注标签训练的VSR模型性能,我们证明即使不涉及人工标注,也能达到与人工标注标签相近的VSR性能。利用自动化标注流程,我们标注了大规模无标注多语种数据库VoxCeleb2和AVSpeech,为四种低VSR资源语言(法语、意大利语、西班牙语和葡萄牙语)生成了1002小时的数据。基于这些自动标签,我们在mTEDx数据集上的四种语言中实现了新的最佳性能,显著超越了现有方法。自动标签已公开提供:https://github.com/JeongHun0716/Visual-Speech-Recognition-for-Low-Resource-Languages