This study evaluates AV-HuBERT's perceptual bio-fidelity by benchmarking its response to incongruent audiovisual stimuli (McGurk effect) against human observers (N=44). Results reveal a striking quantitative isomorphism: AI and humans exhibited nearly identical auditory dominance rates (32.0% vs. 31.8%), suggesting the model captures biological thresholds for auditory resistance. However, AV-HuBERT showed a deterministic bias toward phonetic fusion (68.0%), significantly exceeding human rates (47.7%). While humans displayed perceptual stochasticity and diverse error profiles, the model remained strictly categorical. Findings suggest that current self-supervised architectures mimic multisensory outcomes but lack the neural variability inherent to human speech perception.
翻译:本研究通过将AV-HuBERT对不一致视听刺激(McGurk效应)的反应与人类观察者(N=44)进行基准测试,评估其感知生物保真度。结果揭示了一种显著的定量同构现象:AI与人类表现出几乎相同的听觉主导率(32.0% vs 31.8%),表明该模型捕捉到了听觉抵抗的生物阈值。然而,AV-HuBERT表现出对语音融合的确定性偏向(68.0%),显著超过人类比率(47.7%)。人类表现出感知随机性和多样化的错误模式,而该模型则保持严格的范畴化。研究结果表明,当前的自监督架构能够模拟多感官整合结果,但缺乏人类语音感知固有的神经变异性。