Verified Misguidance: Measuring Structural Citation Failures in Search-Augmented LLMs

Users of search-augmented LLMs rely on citations as evidence that responses are grounded in real sources, and rarely verify the cited pages themselves. Millions of queries per day now pass through these systems, making citation quality a silent determinant of whether users are informed or misled-yet existing benchmarks each address one facet in isolation, leaving the joint structure that determines citation trustworthiness unmeasured. We construct CITETRACE, a large-scale dataset that traces the full citation chain from user query through retrieved source to generated answer: 11,200 real-world queries from 28 communities paired with 112,000 responses from ten models across five providers, yielding 761,495 evaluable citation pairs. We design a three-dimension evaluation framework that scores each citation on intent-purpose alignment, source suitability, and answer-source fidelity, using expert-validated predefined matrices and a five-level fidelity rubric; the framework applies to any system that produces citation-bearing responses. Applying this framework at scale, we identify a systematic pattern we call VERIFIED MISGUIDANCE (VM): models cite real, accessible sources yet fail along one or more dimensions, producing a fidelity-suitability trade-off in which faithful models select inappropriate sources and vice versa. Across our pool, 30.6% of citations distort their sources and 27.1% originate from domain-inappropriate sources; at the response level, up to 96% of users encounter at least one structurally misleading citation. Provider-level differences explain 88-96% of citation-quality variance, suggesting that source selection is governed more by factors beyond individual model capability than by the LLMs themselves. Together, CITETRACE and its evaluation framework provide the first resource for diagnosing structural citation failures in deployed search-augmented systems.

翻译：搜索增强型大语言模型的用户依赖引用作为响应基于真实来源的证据，但极少自行验证所引用的页面。如今每天数百万次查询通过此类系统，引用质量成为用户被正确告知或受到误导的沉默决定因素——然而现有基准测试仅孤立地评估单一维度，未能衡量决定引用可信度的联合结构。我们构建了CITETRACE，一个追踪从用户查询、检索源到生成答案的完整引用链的大规模数据集：包含来自28个社区的11,200个真实世界查询，以及来自五个供应商十种模型生成的112,000条响应，得到761,495个可评估引用对。我们设计了三维评估框架，使用专家验证的预定义矩阵和五级保真度量表对每个引用的意图-目的对齐性、源适用性和答案-源保真度进行评分；该框架适用于任何生成带引用响应的系统。通过大规模应用该框架，我们识别出称为验证性误导的系统性模式：模型引用真实可访问的源，但在一个或多个维度上失效，产生保真度-适用性权衡，即保真度高的模型选择不合适的源，反之亦然。在我们的数据池中，30.6%的引用扭曲了源信息，27.1%的引用源自领域不合适的源；在响应层面，高达96%的用户至少遇到一个结构上具有误导性的引用。供应商级别差异解释了88-96%的引用质量方差，表明源选择更多受超出单个模型能力的因素影响，而非大语言模型本身。综上，CITETRACE及其评估框架为诊断部署中搜索增强系统的结构性引用失败提供了首个资源。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《信任但需验证：军事决策背景下的大型语言模型品格、能力与控制》2026最新59页报告

专知会员服务

21+阅读 · 6月12日

《军事大语言模型的拒绝率测量与消除》

专知会员服务

14+阅读 · 3月13日

评估大语言模型在科学发现中的作用

专知会员服务

19+阅读 · 2025年12月19日

《缓解大语言模型（LLMs）幻觉：面向应用的检索增强生成（RAG）、推理与智能体系统综述》

专知会员服务

24+阅读 · 2025年10月29日