Despite the success of audio-visual large-language models (LLMs), they can produce plausible but ungrounded outputs, termed hallucination. Existing benchmarks focus on environmental sounds (e.g., dog barking) to indicate event occurrence. In contrast, human speech carries fundamentally different, rich semantics and temporal structures, yet it remains unexplored whether current models can accurately align speech content with corresponding visual signals. In this work, we show that speech content can induce hallucinations in audio-visual LLMs. To systematically study this, we introduce SVHalluc, the first comprehensive benchmark for evaluating speech-vision hallucination in audio-visual LLMs. Our benchmark diagnoses speech-vision hallucinations from two critical and complementary aspects: semantic and temporal. Experimental results demonstrate that state-of-the-art open-source audio-visual LLMs struggle with aligning speech content with corresponding visual signals, with a near-random accuracy on multiple tasks. In contrast, Gemini 2.5 Pro significantly outperforms the open-source models. Our analysis suggests that their failures stem from limited ability in cross-modality understanding, despite strong performance in single-modality perception. Our work uncovers a new and fundamental limitation of current audio-visual LLMs and highlights the need for speech-grounded video comprehension. Project page: https://chenshuang-zhang.github.io/projects/svhalluc/.
翻译:尽管音视频大型语言模型取得了成功,但它们可能产生看似合理却无根基的输出,即所谓的幻觉。现有基准侧重于通过环境声音(如狗叫声)来指示事件发生。相比之下,人类语音承载着本质上不同的丰富语义和时序结构,但当前模型能否准确地将语音内容与相应的视觉信号对齐,这一问题尚未得到探究。在这项工作中,我们表明语音内容会引发音视频大型语言模型的幻觉。为系统研究这一问题,我们引入了SVHalluc,这是首个用于评估音视频大型语言模型中语音-视觉幻觉的综合基准。我们的基准从两个关键且互补的方面诊断语音-视觉幻觉:语义方面和时序方面。实验结果表明,最先进的开源音视频大型语言模型难以将语音内容与相应视觉信号对齐,在多项任务上的准确率接近随机水平。相比之下,Gemini 2.5 Pro显著优于开源模型。我们的分析表明,这些失败源于跨模态理解能力的局限,尽管其在单模态感知方面表现强劲。我们的工作揭示了当前音视频大型语言模型的一个新且根本的局限性,并凸显了基于语音的视频理解的需求。项目页面:https://chenshuang-zhang.github.io/projects/svhalluc/。