This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages, especially for low-resource languages that have a limited number of labeled data. Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the different languages without human intervention. To this end, we employ a Whisper model which can conduct both language identification and audio-based speech recognition. It serves to filter data of the desired languages and transcribe labels from the unannotated, multilingual audio-visual data pool. By comparing the performances of VSR models trained on automatic labels and the human-annotated labels, we show that we can achieve similar VSR performance to that of human-annotated labels even without utilizing human annotations. Through the automated labeling process, we label large-scale unlabeled multilingual databases, VoxCeleb2 and AVSpeech, producing 1,002 hours of data for four low VSR resource languages, French, Italian, Spanish, and Portuguese. With the automatic labels, we achieve new state-of-the-art performance on mTEDx in four languages, significantly surpassing the previous methods. The automatic labels are available online: https://github.com/JeongHun0716/Visual-Speech-Recognition-for-Low-Resource-Languages
翻译:本文提出了一种强大的多语言视觉语音识别(VSR)方法,特别适用于标注数据量有限的低资源语言。不同于以往通过迁移其他语言知识提升目标语言VSR性能的方法,本研究探索了能否在不依赖人工干预的前提下,直接增加不同语言自身训练数据量的可能性。为此,我们采用可同时进行语言识别和音频语音识别的Whisper模型,该模型能从无标注的多语种视听数据池中筛选目标语言数据并生成转录标签。通过对比基于自动标签与人工标注标签训练的VSR模型性能,我们发现即使不使用人工标注,也能达到与人工标注标签相当的VSR性能。通过自动化标注流程,我们为VoxCeleb2和AVSpeech两个大规模无标注多语种数据库生成了1,002小时的四种低VSR资源语言(法语、意大利语、西班牙语、葡萄牙语)数据。基于自动标签,我们在mTEDx的四种语言上取得了新的最先进性能,显著超越了以往方法。自动标签已在线公开:https://github.com/JeongHun0716/Visual-Speech-Recognition-for-Low-Resource-Languages