Continual Pre-training (CPT) serves as a fundamental approach for adapting foundation models to domain-specific applications. Scaling laws for pre-training define a power-law relationship between dataset size and the test loss of an LLM. However, the marginal gains from simply increasing data for CPT diminish rapidly, yielding suboptimal data utilization and inefficient training. To address this challenge, we propose a novel perplexity-aware data scaling law to establish a predictive relationship between the perplexity landscape of domain-specific data and the test loss. Our approach leverages the perplexity derived from the pre-trained model on domain data as a proxy for estimating the knowledge gap, effectively quantifying the informational perplexity landscape of candidate training samples. By fitting this scaling law across diverse perplexity regimes, we enable adaptive selection of high-utility data subsets, prioritizing content that maximizes knowledge absorption while minimizing redundancy and noise. Extensive experiments demonstrate that our method consistently identifies near-optimal training subsets and achieves superior performance on both medical and general-domain benchmarks.
翻译:持续预训练(CPT)是将基础模型适配到特定领域应用的基本方法。预训练的缩放定律定义了数据集大小与大型语言模型测试损失之间的幂律关系。然而,单纯增加CPT数据所带来的边际收益会迅速递减,导致数据利用次优和训练效率低下。为应对这一挑战,我们提出了一种新颖的困惑度感知数据缩放定律,以建立领域数据困惑度景观与测试损失之间的预测关系。我们的方法利用预训练模型在领域数据上计算得到的困惑度作为估计知识差距的代理,有效量化了候选训练样本的信息困惑度景观。通过在不同困惑度区间拟合该缩放定律,我们能够自适应地选择高效用数据子集,优先考虑那些能最大化知识吸收同时最小化冗余和噪声的内容。大量实验表明,我们的方法能持续识别出接近最优的训练子集,并在医学和通用领域基准测试中均取得更优性能。