Auditing the Reliability of Multimodal Generative Search

Multimodal Large Language Models (MLLMs) increasingly function as generative search systems that retrieve and synthesize answers from multimedia content, including YouTube videos. Although these systems project authority by citing specific videos as evidence, the extent to which these citations genuinely substantiate the generated claims remains unexamined. We present a large-scale audit of the Gemini 2.5 Pro multimodal search system, analyzing 11,943 claim-video pairs generated across Medical, Economic, and General domains. Through automated verification using three independent LLM judges (87.7% inter-rater agreement), validated against human annotations, we find that depending on the judge's strictness, between 3.7% and 18.7% of video-grounded claims are not supported by their cited sources. The dominant failure modes are not outright contradictions but rather unverifiable specificities and overstated claims, suggesting the system injects precise but ungrounded details from parametric knowledge while citing videos as evidence. Exploratory post-hoc analysis via logistic regression reveals properties associated with these failures: claims departing from source vocabulary ($β= -1.6$ to $-3.1$, $p < 0.01$) and claims with low semantic similarity to the video transcript ($β= -2.1$ to $-11.6$, $p < 0.01$) are significantly more likely to be unsupported. These findings characterize the current trustworthiness of video-based generative search and highlight the gap between the confidence these systems project and the fidelity of their outputs.

翻译：多模态大语言模型（MLLMs）正日益用作生成式搜索系统，从包括YouTube视频在内的多媒体内容中检索并综合生成答案。尽管这些系统通过引用特定视频作为证据来彰显其权威性，但这些引用在多大程度上真正支撑了所生成的论断，至今仍未得到检验。我们对Gemini 2.5 Pro多模态搜索系统进行大规模审计，分析了在医学、经济和通用领域生成的11,943个论断-视频对。通过使用三个独立的LLM评估器（评判者间一致性87.7%）进行自动化验证，并结合人工标注进行校准，我们发现：根据评判严格程度不同，有3.7%至18.7%基于视频的论断并未被其引用的来源所支持。主要的失效模式并非直接矛盾，而是不可验证的特异性细节及夸大其词的表述，这表明系统在引用视频作为证据的同时，从参数化知识中注入了精确但缺乏依据的细节。通过逻辑回归进行的探索性事后分析揭示了与这些失效相关的属性：偏离来源词汇的论断（$β= -1.6$ 至 $-3.1$，$p < 0.01$）以及与视频转录文本语义相似度低的论断（$β= -2.1$ 至 $-11.6$，$p < 0.01$）更显著地可能缺乏证据支持。这些发现刻画了当前基于视频的生成式搜索的可信度特征，并凸显了这些系统所展现的自信与其输出真实性之间的差距。