Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining

Scaling laws for language model training traditionally characterize how performance scales with model size and dataset volume. Prior work has explored architecture variants and data treatments such as dataset filtering and noise injection in language model pretraining; however, these studies have not formalized data quality within a principled scaling law. We introduce a dimensionless data-quality parameter Q, and propose a quality-aware scaling law extending the Chinchilla framework to predict loss as a joint function of model size, data volume, and data quality. The law is motivated by an effective-sample-size and information-theoretic view of noisy or redundant corpora, and it admits two practical estimators for Q: (i) a corruption rate proxy and (ii) a deficiency measure. Through synthetic experiments in neural machine translation and autoregressive modeling -- where we systematically control data quality via multiple levels of noise injection variation -- we show that loss scales predictably with data quality and that higher-quality data can substantially reduce model size and hence compute requirements. Our results demonstrate a sublinear decay of effective data with quality and robustness to moderate data corruption; out-of-sample evaluations further validate the predictive form of the law. Unlike prior empirical analyses, our work establishes an explicit, generalizable law for data quality, offering concrete guidance for balancing data curation effort and model scale in large-scale pretraining.

翻译：语言模型训练的扩展定律传统上描述了性能如何随模型规模和数据集规模扩展。先前的研究探索了架构变体以及数据预处理方法（如数据集过滤和噪声注入）在语言模型预训练中的应用；然而，这些研究并未在原则性的扩展定律框架内形式化数据质量。我们引入了一个无量纲的数据质量参数Q，并提出了一个质量感知的扩展定律，该定律扩展了Chinchilla框架，以预测损失作为模型规模、数据量和数据质量的联合函数。该定律的动机源于对噪声或冗余语料库的有效样本量和信息论视角，并允许使用两种实用的Q估计器：（i）一个腐败率代理指标和（ii）一个缺陷度量。通过在神经机器翻译和自回归建模中的合成实验——其中我们通过多级噪声注入变化来系统控制数据质量——我们表明，损失随数据质量的变化可预测，并且更高质量的数据可以显著减小模型规模，从而降低计算需求。我们的结果表明，有效数据随质量呈亚线性衰减，并对中等程度的数据腐败具有鲁棒性；样本外评估进一步验证了该定律的预测形式。与先前的实证分析不同，我们的工作为数据质量建立了一个明确的、可推广的定律，为大规模预训练中平衡数据整理工作与模型规模提供了具体指导。