Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on knowledge-intensive tasks. In this paper, we formalize fact memorization from an information-theoretic perspective and study how training data distributions affect fact accuracy. We show that fact accuracy is suboptimal (below the capacity limit) whenever the amount of information contained in the training data facts exceeds model capacity. This is further exacerbated when the fact frequency distribution is skewed (e.g. a power law). We propose data selection schemes based on the training loss alone that aim to limit the number of facts in the training data and flatten their frequency distribution. On semi-synthetic datasets containing high-entropy facts, our selection method effectively boosts fact accuracy to the capacity limit. When pretraining language models from scratch on an annotated Wikipedia corpus, our selection method enables a GPT2-Small model (110m parameters) to memorize 1.3X more entity facts compared to standard training, matching the performance of a 10X larger model (1.3B parameters) pretrained on the full dataset.
翻译:大型语言模型在参数中记忆事实性知识时存在困难,常导致幻觉现象和知识密集型任务表现不佳。本文从信息论视角形式化定义了事实记忆过程,并研究训练数据分布对事实准确性的影响。我们证明:当训练数据事实所含信息量超过模型容量时,事实准确性将处于次优状态(低于容量上限)。当事实频率分布呈现偏态(如幂律分布)时,这一现象会进一步加剧。我们提出仅基于训练损失的数选择方案,旨在限制训练数据中的事实数量并平缓其频率分布。在包含高熵事实的半合成数据集上,该选择方法能有效将事实准确性提升至容量上限。当在标注维基百科语料上从头预训练语言模型时,我们的选择方法使GPT2-Small模型(1.1亿参数)记忆的实体事实量较标准训练提升1.3倍,与在全量数据集上预训练的10倍规模模型(13亿参数)表现持平。