Midtraining, the practice of mixing specialized data with more general pretraining data in an intermediate training phase, has become widespread in language model development, yet there is little understanding of what makes it effective. We propose that midtraining functions as distributional bridging by providing better initialization for posttraining. We conduct controlled pretraining experiments, and find that midtraining benefits are largest for domains distant from general pretraining data, such as code and math, and scale with the proximity advantage the midtraining data provides toward the target distribution. In these domains, midtraining consistently outperforms continued pretraining on specialized data alone both in-domain and in terms of mitigating forgetting. We further conduct an investigation on the starting time and mixture weight of midtraining data, using code as a case study, and find that time of introduction and mixture weight interact strongly such that early introduction of specialized data is amenable to high mixture weights, while late introduction requires lower ones. This suggests that late introduction of specialized data outside a plasticity window cannot be compensated for by increasing data mixtures later in training. Beyond midtraining itself, this suggests that distributional transitions between any training phases may benefit from similar bridging strategies.
翻译:中期训练作为一种在中间训练阶段将专业数据与更通用的预训练数据混合使用的实践,已在语言模型开发中得到广泛应用,但其有效性的内在机制尚不明确。本文提出中期训练通过为后训练提供更优初始化而发挥分布桥接作用。通过开展受控预训练实验,我们发现中期训练对远离通用预训练数据的领域(如代码和数学)效益最为显著,且其效果与中期训练数据向目标分布提供的邻近优势呈正相关。在这些领域中,中期训练不仅持续优于仅使用专业数据的持续预训练(表现为域内性能提升),还能有效缓解灾难性遗忘问题。我们进一步以代码领域为案例,探究了中期训练数据的引入时机与混合权重,发现引入时间与混合权重存在强烈交互效应:早期引入专业数据可承受较高混合权重,而晚期引入则需要降低权重。这表明若在可塑性窗口期之外晚期引入专业数据,仅通过后期增加数据混合比例难以弥补性能损失。除中期训练本身外,本研究暗示任何训练阶段间的分布过渡都可能受益于类似的桥接策略。