Users of search-augmented LLMs rely on citations as evidence that responses are grounded in real sources, and rarely verify the cited pages themselves. Millions of queries per day now pass through these systems, making citation quality a silent determinant of whether users are informed or misled-yet existing benchmarks each address one facet in isolation, leaving the joint structure that determines citation trustworthiness unmeasured. We construct CITETRACE, a large-scale dataset that traces the full citation chain from user query through retrieved source to generated answer: 11,200 real-world queries from 28 communities paired with 112,000 responses from ten models across five providers, yielding 761,495 evaluable citation pairs. We design a three-dimension evaluation framework that scores each citation on intent-purpose alignment, source suitability, and answer-source fidelity, using expert-validated predefined matrices and a five-level fidelity rubric; the framework applies to any system that produces citation-bearing responses. Applying this framework at scale, we identify a systematic pattern we call VERIFIED MISGUIDANCE (VM): models cite real, accessible sources yet fail along one or more dimensions, producing a fidelity-suitability trade-off in which faithful models select inappropriate sources and vice versa. Across our pool, 30.6% of citations distort their sources and 27.1% originate from domain-inappropriate sources; at the response level, up to 96% of users encounter at least one structurally misleading citation. Provider-level differences explain 88-96% of citation-quality variance, suggesting that source selection is governed more by factors beyond individual model capability than by the LLMs themselves. Together, CITETRACE and its evaluation framework provide the first resource for diagnosing structural citation failures in deployed search-augmented systems.
翻译:搜索增强型大语言模型的用户依赖引用作为响应基于真实来源的证据,但极少自行验证所引用的页面。如今每天数百万次查询通过此类系统,引用质量成为用户被正确告知或受到误导的沉默决定因素——然而现有基准测试仅孤立地评估单一维度,未能衡量决定引用可信度的联合结构。我们构建了CITETRACE,一个追踪从用户查询、检索源到生成答案的完整引用链的大规模数据集:包含来自28个社区的11,200个真实世界查询,以及来自五个供应商十种模型生成的112,000条响应,得到761,495个可评估引用对。我们设计了三维评估框架,使用专家验证的预定义矩阵和五级保真度量表对每个引用的意图-目的对齐性、源适用性和答案-源保真度进行评分;该框架适用于任何生成带引用响应的系统。通过大规模应用该框架,我们识别出称为验证性误导的系统性模式:模型引用真实可访问的源,但在一个或多个维度上失效,产生保真度-适用性权衡,即保真度高的模型选择不合适的源,反之亦然。在我们的数据池中,30.6%的引用扭曲了源信息,27.1%的引用源自领域不合适的源;在响应层面,高达96%的用户至少遇到一个结构上具有误导性的引用。供应商级别差异解释了88-96%的引用质量方差,表明源选择更多受超出单个模型能力的因素影响,而非大语言模型本身。综上,CITETRACE及其评估框架为诊断部署中搜索增强系统的结构性引用失败提供了首个资源。