Prior works have shown that fine-tuning on new knowledge can induce factual hallucinations in large language models (LLMs), leading to incorrect outputs when evaluated on previously known information. However, the specific manifestations of such hallucination and its underlying mechanisms remain insufficiently understood. Our work addresses this gap by designing a controlled dataset \textit{Biography-Reasoning}, and conducting a fine-grained analysis across multiple knowledge types and two task types, including knowledge question answering (QA) and knowledge reasoning tasks. We find that hallucinations not only severely affect tasks involving newly introduced knowledge, but also propagate to other evaluation tasks. Moreover, when fine-tuning on a dataset in which a specific knowledge type consists entirely of new knowledge, LLMs exhibit elevated hallucination tendencies. This suggests that the degree of unfamiliarity within a particular knowledge type, rather than the overall proportion of new knowledge, is a stronger driver of hallucinations. Through interpretability analysis, we show that learning new knowledge weakens the model's attention to key entities in the input question, leading to an over-reliance on surrounding context and a higher risk of hallucination. Conversely, reintroducing a small amount of known knowledge during the later stages of training restores attention to key entities and substantially mitigates hallucination behavior. Finally, we demonstrate that disrupted attention patterns can propagate across lexically similar contexts, facilitating the spread of hallucinations beyond the original task.
翻译:摘要:已有研究表明,对新知识的微调会引发大语言模型产生事实性幻觉,导致其在评估已知信息时输出错误结果。然而,此类幻觉的具体表现形式及其内在机制仍未被充分理解。本研究通过设计受控数据集《传记-推理》(Biography-Reasoning)并针对多种知识类型与两种任务类型(包括知识问答与知识推理任务)开展细粒度分析,填补了这一研究空白。我们发现:幻觉不仅严重波及涉及新知识的任务,还会扩散至其他评估任务。此外,当微调数据集中特定知识类型完全由新知识构成时,大语言模型表现出更高的幻觉倾向。这表明,某一知识类型内部的不熟悉程度(而非新知识整体占比)是驱动幻觉的更强因素。通过可解释性分析,我们揭示了学习新知识会弱化模型对输入问题中关键实体的注意力,导致其过度依赖周围上下文并增加幻觉风险。相反,在训练后期重新引入少量已知知识可恢复对关键实体的关注,并显著缓解幻觉行为。最后,我们证明注意力模式的破坏可能通过词汇相似性上下文传播,促使幻觉现象蔓延至原始任务之外。