Lip Reading, or Visual Automatic Speech Recognition (V-ASR), is a complex task requiring the interpretation of spoken language exclusively from visual cues, primarily lip movements and facial expressions. This task is especially challenging due to the absence of auditory information and the inherent ambiguity when visually distinguishing phonemes that have overlapping visemes where different phonemes appear identical on the lips. Current methods typically attempt to predict words or characters directly from these visual cues, but this approach frequently encounters high error rates due to coarticulation effects and viseme ambiguity. We propose a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (V-ASR) that addresses these longstanding challenges. First, our model predicts a compact sequence of phonemes from visual inputs using a Video Transformer with a CTC head, thereby reducing the task complexity and achieving robust speaker invariance. This phoneme output then serves as the input to a fine-tuned Large Language Model (LLM), which reconstructs coherent words and sentences by leveraging broader linguistic context. Unlike existing methods that either predict words directly-often faltering on visually similar phonemes-or rely on large-scale multimodal pre-training, our approach explicitly encodes intermediate linguistic structure while remaining highly data efficient. We demonstrate state-of-the-art performance on two challenging datasets, LRS2 and LRS3, where our method achieves significant reductions in Word Error Rate (WER) achieving a SOTA WER of 18.7 on LRS3 despite using 99.4% less labelled data than the next best approach.
翻译:唇读,或称视觉自动语音识别(V-ASR),是一项复杂的任务,要求仅从视觉线索(主要是唇部动作和面部表情)解读口语。由于缺乏听觉信息,以及视觉上区分具有重叠视位(即不同音素在唇部呈现相同外观)的音素时存在固有的模糊性,该任务尤其具有挑战性。现有方法通常试图直接从这些视觉线索预测单词或字符,但由于协同发音效应和视位模糊性,这种方法常遭遇高错误率。我们提出了一种新颖的、以音素为中心的两阶段视觉自动语音识别(V-ASR)框架,以应对这些长期存在的挑战。首先,我们的模型使用带有CTC头的视频Transformer从视觉输入预测一个紧凑的音素序列,从而降低任务复杂度并实现鲁棒的说话人不变性。该音素输出随后作为微调后的大型语言模型(LLM)的输入,LLM通过利用更广泛的上下文语言信息重构连贯的单词和句子。与现有方法(要么直接预测单词——常在视觉相似音素上失效,要么依赖大规模多模态预训练)不同,我们的方法显式编码中间语言结构,同时保持极高的数据效率。我们在两个具有挑战性的数据集LRS2和LRS3上展示了最先进的性能,我们的方法显著降低了词错误率(WER),在LRS3上实现了18.7的SOTA WER,尽管使用的标注数据量比次优方法少99.4%。