SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision

Xubo Liu,Egor Lakomkin,Konstantinos Vougioukas,Pingchuan Ma,Honglie Chen,Ruiming Xie,Morrie Doulaty,Niko Moritz,Jáchym Kolář,Stavros Petridis,Maja Pantic,Christian Fuegen

from arxiv, IEEE/CVF CVPR 2023

Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on increasingly large amounts of video data, while the publicly available transcribed video datasets are limited in size. In this paper, for the first time, we study the potential of leveraging synthetic visual data for VSR. Our method, termed SynthVSR, substantially improves the performance of VSR systems with synthetic lip movements. The key idea behind SynthVSR is to leverage a speech-driven lip animation model that generates lip movements conditioned on the input speech. The speech-driven lip animation model is trained on an unlabeled audio-visual dataset and could be further optimized towards a pre-trained VSR model when labeled videos are available. As plenty of transcribed acoustic data and face images are available, we are able to generate large-scale synthetic data using the proposed lip animation model for semi-supervised VSR training. We evaluate the performance of our approach on the largest public VSR benchmark - Lip Reading Sentences 3 (LRS3). SynthVSR achieves a WER of 43.3% with only 30 hours of real labeled data, outperforming off-the-shelf approaches using thousands of hours of video. The WER is further reduced to 27.9% when using all 438 hours of labeled data from LRS3, which is on par with the state-of-the-art self-supervised AV-HuBERT method. Furthermore, when combined with large-scale pseudo-labeled audio-visual data SynthVSR yields a new state-of-the-art VSR WER of 16.9% using publicly available data only, surpassing the recent state-of-the-art approaches trained with 29 times more non-public machine-transcribed video data (90,000 hours). Finally, we perform extensive ablation studies to understand the effect of each component in our proposed method.

翻译：最近报道的视觉语音识别（VSR）前沿成果往往依赖于日益庞大的视频数据量，而公开的转录视频数据集规模有限。本文首次探索了利用合成视觉数据提升VSR性能的潜力。我们提出的方法SynthVSR通过合成唇部运动显著提升了VSR系统的表现。其核心思想是采用语音驱动的唇部动画模型，该模型可根据输入语音生成对应的唇部运动。该模型在未标注的视听数据集上训练，并可在有标注视频时针对预训练的VSR模型进一步优化。由于大量带转录的语音数据和面部图像已经存在，我们能够利用所提出的唇部动画模型生成大规模合成数据，用于半监督VSR训练。我们在最大规模的公开VSR基准——唇读语句3（LRS3）上评估了该方法。SynthVSR仅使用30小时真实标注数据即可达到43.3%的词错误率（WER），优于需要数千小时视频的传统方法。当使用LRS3全部438小时标注数据时，WER进一步降至27.9%，与前沿的自监督AV-HuBERT方法持平。此外，结合大规模伪标注视听数据后，SynthVSR仅利用公开数据便取得了16.9%的WER新纪录，超越了近期需要29倍非公开机器转录视频数据（90,000小时）训练的前沿方法。最后，我们进行了广泛的消融研究以理解各组成部分的效果。