Large language models (LLMs) can store a vast amount of world knowledge, often extractable via question-answering (e.g., "What is Abraham Lincoln's birthday?"). However, do they answer such questions based on exposure to similar questions during training (i.e., cheating), or by genuinely learning to extract knowledge from sources like Wikipedia? In this paper, we investigate this issue using a controlled biography dataset. We find a strong correlation between the model's ability to extract knowledge and various diversity measures of the training data. $\textbf{Essentially}$, for knowledge to be reliably extracted, it must be sufficiently augmented (e.g., through paraphrasing, sentence shuffling, translations) $\textit{during pretraining}$. Without such augmentation, knowledge may be memorized but not extractable, leading to 0% accuracy, regardless of subsequent instruction fine-tuning. To understand why this occurs, we employ (nearly) linear probing to demonstrate a strong connection between the observed correlation and how the model internally encodes knowledge -- whether it is linearly encoded in the hidden embeddings of entity names or distributed across other token embeddings in the training text. This paper provides $\textbf{several key recommendations for LLM pretraining in the industry}$: (1) rewrite the pretraining data -- using small, auxiliary models -- to provide knowledge augmentation, and (2) incorporate more instruction-finetuning data into the pretraining stage before it becomes too late.
翻译:大型语言模型(LLM)能够存储海量的世界知识,这些知识通常可通过问答形式提取(例如“亚伯拉罕·林肯的生日是什么?”)。然而,它们回答此类问题是基于训练时接触过类似问题(即“作弊”),还是真正学会了从维基百科等来源中提取知识?本文通过受控的人物传记数据集对此问题展开研究。我们发现模型提取知识的能力与训练数据的多种多样性指标存在强相关性。$\textbf{本质上}$,要使知识能够被可靠提取,必须在$\textit{预训练阶段}$对其进行充分增强(例如通过释义、句子重排、翻译等方式)。若缺乏此类增强,知识可能被记忆但无法提取,导致后续无论进行何种指令微调,准确率始终为0%。为理解这一现象的内在机制,我们采用(近似)线性探针方法,证明观察到的相关性与模型内部编码知识的方式存在紧密联系——即知识是线性编码在实体名称的隐藏嵌入中,还是分散在训练文本其他词元的嵌入中。本文为$\textbf{工业界的大语言模型预训练提出了若干关键建议}$:(1)使用小型辅助模型对预训练数据进行重写,以实现知识增强;(2)在时机成熟前,将更多指令微调数据纳入预训练阶段。