Current Audio-Visual Speech Recognition (AVSR) models achieve near-perfect performance on the standard LRS3 benchmark, raising concerns of adaptive overfitting. To systematically assess true generalisability, we construct a highly controlled, unseen evaluation set subsampled from the massive MultiVSR dataset. Unlike standard out-of-distribution benchmarks, our subset strictly matches the acoustic, visual, and demographic distributions of the LRS3 test set. Evaluating five state-of-the-art architectures reveals a universal performance collapse, proving that current systems fail to generalise even under strictly aligned conditions. Through a fine-grained attribute analysis across seven factors, we isolate the specific drivers of this degradation. Furthermore, we uncover a profound lexical bias, expose distinct error patterns, and surprisingly reveal that audio-visual performance even lags behind audio-only settings. We release our matched test set for future benchmarking.
翻译:当前音视频语音识别(AVSR)模型在标准LRS3基准上取得近乎完美的性能,引发了对自适应过拟合的担忧。为系统性评估真实泛化能力,我们从大规模MultiVSR数据集中子采样构建了高度受控的未见评估集。与标准分布外基准不同,我们的子集严格匹配LRS3测试集的声学、视觉和人口统计分布。评估五种最先进架构揭示了普遍的性能崩塌,证明现有系统即使在严格匹配条件下也无法泛化。通过七项因素的细粒度属性分析,我们分离出性能退化的具体驱动因素。此外,我们发现了深层词汇偏差,揭示了不同的错误模式,并令人惊讶地发现音视频性能甚至落后于纯音频设置。我们发布匹配测试集以供未来基准测试使用。