Visual speech recognition (VSR) models now surpass human lipreaders on benchmarks, but do such gains establish human-like visual speech perception? To explore this, we compare three VSR systems with human baselines on the MaFI word-level lipreading dataset using word, character, phoneme, and viseme-level metrics. Although models achieve higher overall accuracy, they succeed and fail on different words than humans. A text-only n-gram baseline given only a few initial phonemes rivals human lipreading. VSR word-level errors are consistently better explained by training word frequency than by the visual informativeness of words. Viseme accuracies, confusion matrices and human-model correlations further show that models gain most on visemes humans find hardest, and show much weaker dependence on visual clarity. Our work demonstrates that VSR systems rely primarily on language cues from training data rather than visual perception, failing to bind visual features into meaningful words.
翻译:视觉语音识别(VSR)模型在基准测试中已超越人类唇读者,但这样的增益是否意味着其建立了类人的视觉语音感知?为探究此问题,我们在MaFI词级唇读数据集上,从词、字符、音素和视位四个层级,将三种VSR系统与人类基线进行对比。尽管模型整体准确率更高,但其成功与失败的单词与人类不同。仅给定少量初始音素的纯文本n-gram基线即可媲美人类唇读能力。与视觉信息丰富度相比,训练词频更能解释VSR的词汇级错误。视位准确率、混淆矩阵及人机相关性进一步表明:模型在人类认为最困难的视位上增益最大,且对视觉清晰度的依赖显著较弱。本研究证明,VSR系统主要依赖训练数据中的语言线索而非视觉感知,未能将视觉特征绑定为有意义的词语。