Large language models (LLMs) acquire most of their factual knowledge during the pre-training stage, through next token prediction. Subsequent stages of post-training often introduce new facts outwith the parametric knowledge, giving rise to hallucinations. While it has been demonstrated that supervised fine-tuning (SFT) on new knowledge may exacerbate the problem, the underlying mechanisms are still poorly understood. We conduct a controlled fine-tuning experiment, focusing on closed-book QA, and find latent directions that causally contribute to hallucinations. Specifically, we fine-tune Llama 3.1 8B, Gemma 2 9B and Mistral 7B v03 on seven distinct single QA datasets, controlling for the percentage of new knowledge and number of training epochs. By measuring performance on the test set, we validate that incrementally introducing new knowledge increases hallucinations, with the effect being more pronounced with prolonged training. We leverage pre-trained sparse autoencoders (SAEs) to analyze residual stream activations across various checkpoints for each model and propose Monotonic Relationship Feature Identification (MoRFI) for capturing causally relevant latents. MoRFI filters SAE features that respond monotonically to controlled fine-tuning data mixtures of a target property. Our findings show that exposure to unknown facts disrupts the model's ability to retrieve stored knowledge along a set of directions in the residual stream. Our pipeline reliably discovers them across distinct models, recovering knowledge through single-latent interventions.
翻译:大语言模型(LLMs)在预训练阶段通过下一词元预测获取了大部分事实性知识。后续的后训练阶段常常引入超出参数化知识范围的新事实,从而引发幻觉现象。尽管有研究表明,针对新知识的监督微调(SFT)可能加剧该问题,但其潜在机制仍鲜为人知。我们开展了受控的微调实验,聚焦于闭卷问答任务,并发现了因果性地导致幻觉的潜在方向。具体而言,我们在七个独立的小型问答数据集上对Llama 3.1 8B、Gemma 2 9B和Mistral 7B v03进行微调,控制新知识占比与训练轮数。通过测试集性能测量,我们验证了逐步引入新知识会增加幻觉,且延长训练会加剧此效应。我们利用预训练的稀疏自编码器(SAEs)分析各模型不同检查点处的残差流激活,并提出单调关系特征识别(MoRFI)方法以捕获因果相关的潜在变量。MoRFI筛选出在受控微调数据混合中随目标属性单调响应的SAE特征。实验表明,接触未知事实会破坏模型沿残差流特定方向检索存储知识的能力。我们的流程可跨不同模型可靠发现这些方向,并通过单潜在变量干预恢复知识。