Auditing the Reliability of Multimodal Generative Search

Multimodal Large Language Models (MLLMs) increasingly function as generative search systems that retrieve and synthesize answers from multimedia content, including YouTube videos. Although these systems project authority by citing specific videos as evidence, the extent to which these citations genuinely substantiate the generated claims remains underexplored. We present a large-scale audit of the Gemini 2.5 Pro multimodal search system, analyzing 11,943 claim-video pairs generated across Medical, Economic, and General domains. Through automated verification using three independent LLM judges (87.7\% inter-rater agreement), validated against human annotations, we find that depending on the judge's strictness, between 3.7\% and 18.7\% of video-grounded claims are not supported by their cited sources. The dominant failure modes are not outright contradictions but rather unverifiable specificities and overstated claims, suggesting the system injects precise but ungrounded details from parametric knowledge while citing videos as evidence. Exploratory post-hoc analysis via logistic regression reveals properties associated with these failures: claims departing from source vocabulary ($β= -1.6$ to $-3.1$, $p < 0.01$) and claims with low semantic similarity to the video transcript ($β= -2.1$ to $-11.6$, $p < 0.01$) are significantly more likely to be unsupported. These findings characterize the current trustworthiness of video-based generative search and highlight the gap between the confidence these systems project and the fidelity of their outputs. The dataset is available at https://anonymous.4open.science/r/icwsm-gemini-audit-04DF .

翻译：多模态大语言模型（MLLMs）正越来越多地作为生成式搜索系统运行，从包括YouTube视频在内的多媒体内容中检索并综合生成答案。尽管这些系统通过引用特定视频作为证据来展示其权威性，但这些引用在多大程度上真正支撑了生成的声明却尚待深入研究。我们针对Gemini 2.5 Pro多模态搜索系统开展大规模审计，分析了在医疗、经济和通用领域生成的11,943对声明-视频组合。通过使用三个独立的大语言模型评审者（评判间一致性达87.7%）进行自动化验证，并结合人工标注进行校准，我们发现在不同评判严格度下，有3.7%至18.7%的基于视频的声明未能被其引用来源支持。主要的故障模式并非直接矛盾，而是不可验证的细节特化和夸大其词的声明，这表明系统在引用视频作为证据的同时，从参数化知识中注入了精确但无根据的细节。通过逻辑回归进行的探索性事后分析揭示了与这些故障相关的属性：偏离源词汇的声明（β = -1.6至-3.1，p < 0.01）以及与视频转录文本语义相似度低的声明（β = -2.1至-11.6，p < 0.01）更有可能得不到支持。这些发现刻画了当前基于视频的生成式搜索的可信度特征，并凸显了这类系统所表现出的自信与其输出真实性之间的差距。该数据集可于 https://anonymous.4open.science/r/icwsm-gemini-audit-04DF 获取。