Despite the recent observation that large language models (LLMs) can store substantial factual knowledge, there is a limited understanding of the mechanisms of how they acquire factual knowledge through pretraining. This work addresses this gap by studying how LLMs acquire factual knowledge during pretraining. The findings reveal several important insights into the dynamics of factual knowledge acquisition during pretraining. First, counterintuitively, we observe that pretraining on more data shows no significant improvement in the model's capability to acquire and maintain factual knowledge. Next, there is a power-law relationship between training steps and forgetting of memorization and generalization of factual knowledge, and LLMs trained with duplicated training data exhibit faster forgetting. Third, training LLMs with larger batch sizes can enhance the models' robustness to forgetting. Overall, our observations suggest that factual knowledge acquisition in LLM pretraining occurs by progressively increasing the probability of factual knowledge presented in the pretraining data at each step. However, this increase is diluted by subsequent forgetting. Based on this interpretation, we demonstrate that we can provide plausible explanations for recently observed behaviors of LLMs, such as the poor performance of LLMs on long-tail knowledge and the benefits of deduplicating the pretraining corpus.
翻译:尽管最近观察到大型语言模型(LLMs)能够存储大量事实性知识,但对其通过预训练获取事实性知识的机制理解有限。本研究通过探究LLMs在预训练期间如何获取事实性知识来填补这一空白。研究结果揭示了预训练过程中事实性知识获取动态的若干重要发现。首先,与直觉相反,我们观察到使用更多数据进行预训练并未显著提升模型获取和保持事实性知识的能力。其次,训练步数与事实性知识的记忆遗忘及泛化之间存在幂律关系,且使用重复训练数据训练的LLMs表现出更快的遗忘速度。第三,使用更大批量训练LLMs可以增强模型对遗忘的鲁棒性。总体而言,我们的观察表明,LLM预训练中的事实性知识获取是通过逐步提高预训练数据中呈现的事实性知识在每一步的概率实现的。然而,这种提升会被后续的遗忘所稀释。基于此解释,我们证明了可以为近期观察到的LLMs行为提供合理解释,例如LLMs在长尾知识上的较差表现以及去重预训练语料带来的益处。