Language models achieve impressive performance on a variety of knowledge, language, and reasoning tasks due to the scale and diversity of pretraining data available. The standard training recipe is a two-stage paradigm: pretraining first on the full corpus of data followed by specialization on a subset of high quality, specialized data from the full corpus. In the multi-domain setting, this involves continued pretraining of multiple models on each specialized domain, referred to as split model training. We propose a method for pretraining multiple models independently over a general pretraining corpus, and determining the optimal compute allocation between pretraining and continued pretraining using scaling laws. Our approach accurately predicts the loss of a model of size N with D pretraining and D' specialization tokens, and extrapolates to larger model sizes and number of tokens. Applied to language model training, our approach improves performance consistently across common sense knowledge and reasoning benchmarks across different model sizes and compute budgets.
翻译:语言模型凭借预训练数据的规模和多样性,在各类知识、语言及推理任务上展现出卓越性能。标准训练流程采用两阶段范式:首先在完整语料库上进行预训练,随后针对该语料库中高质量的专业数据子集进行专门化训练。在多领域场景中,这涉及对多个模型在各专有领域持续进行预训练(即分割模型训练)。我们提出一种方法,可在通用预训练语料库上独立预训练多个模型,并利用缩放定律确定预训练与持续预训练之间的最优计算资源分配。该方法能精确预测具有N个参数、D个预训练token及D'个专门化token的模型损失,并可外推至更大模型规模与更多token数量。应用于语言模型训练时,我们的方法在不同模型尺寸与计算预算下,持续提升了常识知识与推理基准测试的性能。