Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements. VSR is regarded as a challenging task because of the insufficient information on lip movements. In this paper, we propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality. Different from the previous methods, the proposed AKVSR 1) utilizes rich audio knowledge encoded by a large-scale pretrained audio model, 2) saves the linguistic information of audio knowledge in compact audio memory by discarding the non-linguistic information from the audio through quantization, and 3) includes Audio Bridging Module which can find the best-matched audio features from the compact audio memory, which makes our training possible without audio inputs, once after the compact audio memory is composed. We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used datasets, LRS2 and LRS3.
翻译:视觉语音识别(VSR)旨在从无声唇部动作中预测口语词汇。由于唇部动作所含信息不足,VSR被视为一项具有挑战性的任务。本文提出一种音频知识赋能的视觉语音识别框架(AKVSR),通过利用音频模态来补充视觉模态中缺失的语音信息。与先前方法不同,所提出的AKVSR:1)利用大规模预训练音频模型编码的丰富音频知识;2)通过量化技术去除音频中的非语言信息,将音频知识的语言信息以紧凑形式保存于音频存储器中;3)引入音频桥接模块,该模块能从紧凑音频存储器中检索最匹配的音频特征,从而在紧凑音频存储器构建完成后,无需音频输入即可进行训练。我们通过大量实验验证了所提方法的有效性,并在广泛使用的数据集LRS2和LRS3上取得了新的最优性能。