The GPT-4 technical report suggests that downstream performance can be predicted from pre-training signals, but offers little methodological detail on how to quantify this. This work address this gap by modeling knowledge retention, the capacity of a pre-trained language model to memorize factual information from its corpus, and introduce a principled method to estimate it prior to training. We propose Size-dependent Mutual Information (SMI), an information-theoretic predictor that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book question answering (QA) accuracy. SMI is validated through large-scale document retrieval over the disclosed pre-training corpora of 21 public and 3 custom models, combined with a robust multi-template QA evaluation. Experiments show that SMI significantly outperforms repetition-based baselines and achieves $R^2$ > 0.7 in predicting QA accuracy for models above 1B parameters, without additional training. The analysis further reveals diminishing returns from scaling data and model size and provides evidence for an intrinsic upper bound on knowledge retention achievable by pre-training alone, motivating retrieval and other augmentation strategies. The dataset and code are available at https://github.com/yuhui1038/SMI.
翻译:GPT-4技术报告指出,下游性能可以从预训练信号中预测,但关于如何量化这一点,其提供的方法细节甚少。本研究通过建模知识保留——即预训练语言模型从其语料库中记忆事实信息的能力——来填补这一空白,并提出一种在训练前对其进行估计的原则性方法。我们提出了规模依赖互信息,这是一种信息论预测器,它整合了知识频率、知识特异性与模型规模,以预测闭卷问答的准确率。SMI通过在21个公开模型和3个定制模型的已公开预训练语料上进行大规模文档检索,并结合稳健的多模板QA评估,得到了验证。实验表明,SMI显著优于基于重复的基线方法,在预测参数量超过10亿的模型的QA准确率时,其决定系数$R^2$ > 0.7,且无需额外训练。分析进一步揭示了扩展数据和模型规模带来的收益递减现象,并为仅通过预训练可实现的知识保留存在内在上限提供了证据,从而为检索及其他增强策略提供了动机。数据集与代码可在 https://github.com/yuhui1038/SMI 获取。