Pretraining is the preliminary and fundamental step in developing capable language models (LM). Despite this, pretraining data design is critically under-documented and often guided by empirically unsupported intuitions. To address this, we pretrain 28 1.5B parameter decoder-only models, training on data curated (1) at different times, (2) with varying toxicity and quality filters, and (3) with different domain compositions. First, we quantify the effect of pretraining data age. A temporal shift between evaluation data and pretraining data leads to performance degradation, which is not overcome by finetuning. Second, we explore the effect of quality and toxicity filters, showing a trade-off between performance on standard benchmarks and risk of toxic generations. Our findings indicate there does not exist a one-size-fits-all solution to filtering training data. We also find that the effects of different types of filtering are not predictable from text domain characteristics. Lastly, we empirically validate that the inclusion of heterogeneous data sources, like books and web, is broadly beneficial and warrants greater prioritization. These findings constitute the largest set of experiments to validate, quantify, and expose many undocumented intuitions about text pretraining, which we hope will help support more informed data-centric decisions in LM development.
翻译:预训练是开发高性能语言模型(LM)的基础性关键步骤。然而,预训练数据的设计却严重缺乏系统记录,且常依赖于缺乏实证支撑的经验直觉。为弥补这一不足,我们预训练了28个1.5B参数的仅解码器模型,分别使用(1)不同时间节点收集的、(2)经不同毒性与质量筛选的、(3)不同领域构成的数据进行训练。首先,我们量化了预训练数据时效性的影响:评估数据与预训练数据之间存在时间偏移会导致性能下降,且微调无法弥补这一缺陷。其次,我们探索了质量与毒性筛选的效果,发现标准基准性能与毒性文本生成风险之间存在权衡关系。研究结果表明,训练数据筛选不存在通用解决方案,同时不同筛选方式的效果难以通过文本领域特征进行预测。最后,我们通过实证验证发现,包含书籍和网页等异质数据源具有广泛益处,应获得更高优先级。这些发现构成了迄今最大规模的实验集合,用于验证、量化并揭示关于文本预训练的诸多未经验证的经验直觉,有望为语言模型开发中更科学的数据中心决策提供支撑。