Visual Speech Recognition (VSR) is a task to predict a sentence or word from lip movements. Some works have been recently presented which use audio signals to supplement visual information. However, existing methods utilize only limited information such as phoneme-level features and soft labels of Automatic Speech Recognition (ASR) networks. In this paper, we present a Multi-Temporal Lip-Audio Memory (MTLAM) that makes the best use of audio signals to complement insufficient information of lip movements. The proposed method is mainly composed of two parts: 1) MTLAM saves multi-temporal audio features produced from short- and long-term audio signals, and the MTLAM memorizes a visual-to-audio mapping to load stored multi-temporal audio features from visual features at the inference phase. 2) We design an audio temporal model to produce multi-temporal audio features capturing the context of neighboring words. In addition, to construct effective visual-to-audio mapping, the audio temporal models can generate audio features time-aligned with visual features. Through extensive experiments, we validate the effectiveness of the MTLAM achieving state-of-the-art performances on two public VSR datasets.
翻译:视觉语音识别(Visual Speech Recognition, VSR)是一项通过唇部运动预测句子或单词的任务。近期有研究利用音频信号补充视觉信息,但现有方法仅使用有限信息,如音素级特征和自动语音识别(ASR)网络的软标签。本文提出多时相唇部音频记忆(Multi-Temporal Lip-Audio Memory, MTLAM),以充分利用音频信号弥补唇部运动信息的不足。该方法主要由两部分组成:1)MTLAM保存短时与长时音频信号产生的多时相音频特征,并在推理阶段通过记忆视觉到音频的映射,从视觉特征加载存储的多时相音频特征;2)我们设计音频时序模型生成捕捉相邻单词上下文的多时相音频特征。此外,为构建有效的视觉到音频映射,音频时序模型可生成与视觉特征时间对齐的音频特征。通过大量实验,我们验证了MTLAM的有效性,并在两个公开VSR数据集上取得最优性能。