The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

Visual speech recognition (VSR) models now surpass human lipreaders on benchmarks, but do such gains establish human-like visual speech perception? To explore this, we compare three VSR systems with human baselines on the MaFI word-level lipreading dataset using word, character, phoneme, and viseme-level metrics. Although models achieve higher overall accuracy, they succeed and fail on different words than humans. A text-only n-gram baseline given only a few initial phonemes rivals human lipreading. VSR word-level errors are consistently better explained by training word frequency than by the visual informativeness of words. Viseme accuracies, confusion matrices and human-model correlations further show that models gain most on visemes humans find hardest, and show much weaker dependence on visual clarity. Our work demonstrates that VSR systems rely primarily on language cues from training data rather than visual perception, failing to bind visual features into meaningful words.

翻译：视觉语音识别（VSR）模型在基准测试中已超越人类唇读者，但这样的增益是否意味着其建立了类人的视觉语音感知？为探究此问题，我们在MaFI词级唇读数据集上，从词、字符、音素和视位四个层级，将三种VSR系统与人类基线进行对比。尽管模型整体准确率更高，但其成功与失败的单词与人类不同。仅给定少量初始音素的纯文本n-gram基线即可媲美人类唇读能力。与视觉信息丰富度相比，训练词频更能解释VSR的词汇级错误。视位准确率、混淆矩阵及人机相关性进一步表明：模型在人类认为最困难的视位上增益最大，且对视觉清晰度的依赖显著较弱。本研究证明，VSR系统主要依赖训练数据中的语言线索而非视觉感知，未能将视觉特征绑定为有意义的词语。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

[ICML 2026] 看见的还是思考的？用奖励机制区分“看错”与“想错”：视觉语言模型奖励感知

专知会员服务

10+阅读 · 5月15日

【NTU博士论文】让语言模型更接近人类学习者

专知会员服务

18+阅读 · 2025年5月3日

【博士论文】学习视觉-语言表示以实现多模态理解

专知会员服务

28+阅读 · 2025年2月8日

【博士论文】语言模型与人类偏好对齐，148页pdf

专知会员服务

32+阅读 · 2024年4月21日