Large language models (LLMs) can store a vast amount of world knowledge, often extractable via question-answering (e.g., "What is Abraham Lincoln's birthday?"). However, do they answer such questions based on exposure to similar questions during training (i.e., cheating), or by genuinely learning to extract knowledge from sources like Wikipedia? In this paper, we investigate this issue using a controlled biography dataset. We find a strong correlation between the model's ability to extract knowledge and various diversity measures of the training data. $\textbf{Essentially}$, for knowledge to be reliably extracted, it must be sufficiently augmented (e.g., through paraphrasing, sentence shuffling) $\textit{during pretraining}$. Without such augmentation, knowledge may be memorized but not extractable, leading to 0% accuracy, regardless of subsequent instruction fine-tuning. To understand why this occurs, we employ (nearly) linear probing to demonstrate a strong connection between the observed correlation and how the model internally encodes knowledge -- whether it is linearly encoded in the hidden embeddings of entity names or distributed across other token embeddings in the training text. This paper provides $\textbf{several key recommendations for LLM pretraining in the industry}$: (1) rewrite the pretraining data -- using small, auxiliary models -- to provide knowledge augmentation, and (2) incorporate more instruction-finetuning data into the pretraining stage before it becomes too late.
翻译:大型语言模型(LLM)能够存储海量的世界知识,并通常可通过问答方式提取(例如“亚伯拉罕·林肯的生日是什么?”)。然而,它们是基于训练过程中接触过类似问题(即作弊)来回答这些问题,还是真正学会了从维基百科等来源提取知识?本文利用一个受控的传记数据集对此问题展开研究。我们发现,模型提取知识的能力与训练数据的多种多样性指标之间存在强相关性。$\textbf{本质上}$,要确保知识能够被可靠提取,必须在$\textit{预训练期间}$对其充分增强(例如通过释义、句子打乱)。缺乏此类增强时,知识可能被记忆却无法被提取,导致后续无论经过何种指令微调,准确率均为0%。为解释这一现象,我们采用(近似)线性探测方法,证明观测到的相关性与其内部编码知识的方式密切相关——即知识是以线性形式编码于实体名称的隐藏嵌入中,还是分散于训练文本的其他词元嵌入里。本文为$\textbf{工业界的LLM预训练提供了若干关键建议}$:(1) 使用小型辅助模型重写预训练数据,以提供知识增强;(2) 在预训练阶段中尽早融入更多指令微调数据,以免为时过晚。