As AI labs approach a data ceiling where compute capacity outpaces the rate of new high-quality text generation, language model pretraining is shifting toward a data-constrained, compute-abundant regime that demands productive multi-epoch training on fixed corpora. Standard autoregressive (AR) pretraining overfits severely in this setting, reaching its optimum early and then continuously deteriorating. We investigate data augmentation as a regularizer to mitigate this overfitting and enable productive training for hundreds of epochs on the same data. We introduce three orthogonal categories of augmentation for AR pretraining: token-level noise (masking, random replacement), sequence permutations (right-to-left prediction, Fill-in-the-Middle), and target offset prediction ($x_{t+i}$ for $i > 1$). Through systematic ablations, we find that individual augmentations delay overfitting and lower validation loss relative to the baseline, with random token replacement achieving the best minimum loss among individual methods. Combining augmentation categories further lowers the minimum validation loss. Our experiments demonstrate that data augmentations mitigate AR pretraining's data inefficiency and offer a promising solution to the data-constrained regime. All code and data are available at https://github.com/michaelchen-lab/data-augmentations-for-pretraining
翻译:随着人工智能实验室接近数据上限——即计算能力超越高质量新文本生成速度——语言模型预训练正转向数据受限、计算充裕的范式,这要求在固定语料库上进行高效的多轮训练。在此场景下,标准自回归(AR)预训练会出现严重过拟合,早期达到最优后持续恶化。我们研究将数据增强作为正则化手段,以缓解这种过拟合,使同一数据的数百轮训练依然有效。我们提出三类正交的自回归预训练增强方法:词元级噪声(掩码、随机替换)、序列排列(从右到左预测、中间填充)以及目标偏移预测($x_{t+i}$,其中 $i > 1$)。通过系统性消融实验,我们发现:相对于基线,单一增强方法可延迟过拟合并降低验证损失,其中随机替换在单一方法中实现最优最小损失;组合不同增强类别可进一步降低最小验证损失。我们的实验表明,数据增强能缓解自回归预训练的数据效率不足问题,为数据受限场景提供有前景的解决方案。所有代码和数据均公开于 https://github.com/michaelchen-lab/data-augmentations-for-pretraining