When large language models are aligned via supervised fine-tuning, they may encounter new factual information that was not acquired through pre-training. It is often conjectured that this can teach the model the behavior of hallucinating factually incorrect responses, as the model is trained to generate facts that are not grounded in its pre-existing knowledge. In this work, we study the impact of such exposure to new knowledge on the capability of the fine-tuned model to utilize its pre-existing knowledge. To this end, we design a controlled setup, focused on closed-book QA, where we vary the proportion of the fine-tuning examples that introduce new knowledge. We demonstrate that large language models struggle to acquire new factual knowledge through fine-tuning, as fine-tuning examples that introduce new knowledge are learned significantly slower than those consistent with the model's knowledge. However, we also find that as the examples with new knowledge are eventually learned, they linearly increase the model's tendency to hallucinate. Taken together, our results highlight the risk in introducing new factual knowledge through fine-tuning, and support the view that large language models mostly acquire factual knowledge through pre-training, whereas fine-tuning teaches them to use it more efficiently.
翻译:当大型语言模型通过监督微调进行对齐时,可能会接触到预训练阶段未获取的新事实信息。这通常被认为可能教会模型产生事实性错误回答的幻觉行为,因为模型被训练生成未基于其已有知识的事实。本研究探讨了此类新知识暴露对微调模型利用其已有知识能力的影响。为此,我们设计了一个专注于闭卷问答的受控实验环境,通过调整引入新知识的微调示例比例进行系统性研究。实验表明,大型语言模型难以通过微调有效获取新事实知识——引入新知识的微调示例学习速度显著慢于与模型已有知识一致的示例。然而,随着包含新知识的示例最终被习得,它们会线性增加模型的幻觉倾向。综合而言,我们的研究结果揭示了通过微调引入新事实知识的潜在风险,并支持以下观点:大型语言模型主要通过预训练获取事实性知识,而微调主要教会它们更有效地运用这些知识。