Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.
翻译:音频-视觉语音识别(AVSR)利用声学与视觉信息在噪声环境下实现鲁棒识别,然而模型如何平衡两种模态仍不明确。本文提出Dr. SHAP-AV框架,采用Shapley值分析AVSR中的模态贡献。通过跨两个基准测试、六种模型及不同信噪比水平的实验,我们引入三类分析:全局SHAP用于整体模态平衡,生成SHAP用于解码过程中的贡献动态,时序对齐SHAP用于输入-输出对应关系。研究发现:模型在噪声条件下会转向依赖视觉信息,但即使在严重退化情况下仍保持较高的音频贡献;模态平衡在生成过程中动态演化;时序对齐在噪声环境下保持稳定;信噪比是主导模态权重的关键因素。这些发现揭示了持续的音频偏置现象,进而启发了自适应模态加权机制与基于Shapley归因的标准化AVSR诊断方法。