Retrieval-Augmented Generation (RAG) models are critically undermined by citation hallucinations, a deceptive failure where a model confidently cites a source that fails to support its claim. Existing work often attributes hallucination to a simple over-reliance on the model's parametric knowledge. We challenge this view and introduce FACTUM (Framework for Attesting Citation Trustworthiness via Underlying Mechanisms), a framework of four mechanistic scores measuring the distinct contributions of a model's attention and FFN pathways, and the alignment between them. Our analysis reveals two consistent signatures of correct citation: a significantly stronger contribution from the model's parametric knowledge and greater use of the attention sink for information synthesis. Crucially, we find the signature of a correct citation is not static but evolves with model scale. For example, the signature of a correct citation for the Llama-3.2-3B model is marked by higher pathway alignment, whereas for the Llama-3.1-8B model, it is characterized by lower alignment, where pathways contribute more distinct, orthogonal information. By capturing this complex, evolving signature, FACTUM outperforms state-of-the-art baselines by up to 37.5% in AUC. Our findings reframe citation hallucination as a complex, scale-dependent interplay between internal mechanisms, paving the way for more nuanced and reliable RAG systems.
翻译:检索增强生成(RAG)模型受到引文幻觉的严重损害,这是一种欺骗性故障,即模型自信地引用一个无法支持其主张的来源。现有工作通常将幻觉归因于对模型参数化知识的简单过度依赖。我们挑战了这一观点,并引入了FACTUM(通过底层机制验证引文可信度的框架),这是一个由四个机制性分数组成的框架,用于衡量模型注意力与FFN路径的独特贡献以及二者之间的对齐程度。我们的分析揭示了正确引用的两个一致特征:模型参数化知识的贡献显著更强,以及更多地利用注意力汇进行信息合成。关键的是,我们发现正确引文的特征并非静态,而是随模型规模演变。例如,对于Llama-3.2-3B模型,正确引文的特征表现为更高的路径对齐;而对于Llama-3.1-8B模型,其特征则是较低的对齐度,此时各路径贡献了更多独特、正交的信息。通过捕捉这种复杂且演变的特征,FACTUM在AUC指标上优于最先进的基线方法,提升幅度高达37.5%。我们的研究结果将引文幻觉重新定义为内部机制之间复杂、依赖规模的相互作用,为开发更细致、更可靠的RAG系统铺平了道路。