This paper proposes a novel, resource-efficient approach to Visual Speech Recognition (VSR) leveraging speech representations produced by any trained Automatic Speech Recognition (ASR) model. Moving away from the resource-intensive trends prevalent in recent literature, our method distills knowledge from a trained Conformer-based ASR model, achieving competitive performance on standard VSR benchmarks with significantly less resource utilization. Using unlabeled audio-visual data only, our baseline model achieves a word error rate (WER) of 47.4% and 54.7% on the LRS2 and LRS3 test benchmarks, respectively. After fine-tuning the model with limited labeled data, the word error rate reduces to 35% (LRS2) and 45.7% (LRS3). Our model can be trained on a single consumer-grade GPU within a few days and is capable of performing real-time end-to-end VSR on dated hardware, suggesting a path towards more accessible and resource-efficient VSR methodologies.
翻译:本文提出了一种新颖且资源高效的视觉语音识别方法,该方法利用任意训练好的自动语音识别(ASR)模型生成的语音表征。不同于近期文献中普遍存在的资源密集型趋势,我们的方法从基于Conformer的已训练ASR模型中蒸馏知识,在显著降低资源消耗的同时,在标准VSR基准测试上取得了具有竞争力的性能。仅使用无标签音视频数据,我们的基线模型在LRS2和LRS3测试基准上的词错误率(WER)分别达到47.4%和54.7%。通过使用有限标注数据对模型进行微调后,词错误率降至LRS2的35%和LRS3的45.7%。该模型可在单块消费级GPU上数日内完成训练,并能在老旧硬件上实现实时端到端VSR,这为开发更易获取且资源高效的VSR方法提供了可行路径。