Audio-visual speech recognition has received a lot of attention due to its robustness against acoustic noise. Recently, the performance of automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR, respectively) has been substantially improved, mainly due to the use of larger models and training sets. However, accurate labelling of datasets is time-consuming and expensive. Hence, in this work, we investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size. For this purpose, we use publicly-available pre-trained ASR models to automatically transcribe unlabelled datasets such as AVSpeech and VoxCeleb2. Then, we train ASR, VSR and AV-ASR models on the augmented training set, which consists of the LRS2 and LRS3 datasets as well as the additional automatically-transcribed data. We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions. The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3. In particular, it achieves a WER of 0.9% on LRS3, a relative improvement of 30% over the current state-of-the-art approach, and outperforms methods that have been trained on non-publicly available datasets with 26 times more training data.
翻译:音频-视觉语音识别因其对声学噪声的鲁棒性而受到广泛关注。近年来,自动语音识别、视觉语音识别和音频-视觉语音识别(分别简称ASR、VSR和AV-ASR)的性能显著提升,主要归因于更大型模型和训练集的使用。然而,数据集的精确标注既耗时又昂贵。因此,本研究探讨了利用无标注数据集的自动生成转录文本来扩大训练集规模的可能性。为此,我们使用公开可用的预训练ASR模型,自动转录如AVSpeech和VoxCeleb2等无标注数据集。随后,我们在由LRS2和LRS3数据集以及额外自动转录数据组成的增强训练集上训练ASR、VSR和AV-ASR模型。结果表明,尽管使用了带有噪声的转录文本,遵循文献中的最新趋势来扩大训练集规模,仍能降低词错误率(WER)。所提出的模型在LRS2和LRS3数据集上的AV-ASR任务中取得了新的最佳性能。具体而言,其在LRS3上实现了0.9%的词错误率,较当前最先进方法相对提升了30%,并且优于那些使用非公开数据集训练且训练数据量多出26倍的方法。