Large-scale pretraining datasets drive the success of large language models (LLMs). However, these web-scale corpora inevitably contain large amounts of noisy data due to unregulated web content or randomness inherent in data. Although LLM pretrainers often speculate that such noise contributes to instabilities in large-scale LLM pretraining and, in the worst cases, loss divergence, this phenomenon remains poorly understood.In this work, we present a systematic empirical study of whether noisy data causes LLM pretraining divergences and how it does so. By injecting controlled synthetic uniformly random noise into otherwise clean datasets, we analyze training dynamics across model sizes ranging from 480M to 5.2B parameters. We show that noisy data indeed induces training loss divergence, and that the probability of divergence depends strongly on the noise type, amount of noise, and model scale. We further find that noise-induced divergences exhibit activation patterns distinct from those caused by high learning rates, and we provide diagnostics that differentiate these two failure modes. Together, these results provide a large-scale, controlled characterization of how noisy data affects loss divergence in LLM pretraining.
翻译:大规模预训练数据集是大型语言模型(LLM)成功的关键。然而,由于网络内容缺乏监管或数据固有的随机性,这些网络级语料库不可避免地包含大量噪声数据。尽管LLM预训练者常推测此类噪声会导致大规模LLM预训练的不稳定性,并在最坏情况下引发损失发散,但这一现象迄今尚未得到充分理解。本研究通过系统性的实证分析,探讨噪声数据是否以及如何导致LLM预训练发散。通过在洁净数据集中注入受控的合成均匀随机噪声,我们分析了参数量从4.8亿到52亿不同规模模型的训练动态。实验表明:噪声数据确实会引发训练损失发散,且发散概率高度依赖于噪声类型、噪声量级与模型规模。进一步研究发现,噪声引发的发散表现出与高学习率所致发散不同的激活模式,我们据此提出了区分这两种失效模式的诊断方法。这些结果共同构成了关于噪声数据如何影响LLM预训练损失发散的大规模受控特征描述。