Maximizing the likelihood of the next token is an established, statistically sound objective for pre-training language models. In this paper we show that we can train better models faster by pre-aggregating the corpus with a collapsed $n$-gram distribution. Previous studies have proposed corpus-level $n$-gram statistics as a regularizer; however, the construction and querying of such $n$-grams, if done naively, prove to be costly and significantly impede training speed, thereby limiting their application in modern large language model pre-training. We introduce an alternative compact representation of the next token distribution that, in expectation, aligns with the complete $n$-gram distribution while markedly reducing variance across mini-batches compared to the standard next-token loss. Empirically, we demonstrate that both the $n$-gram regularized model and our approximation yield substantial improvements in model quality and convergence rate compared to existing methods. Furthermore, our approximation facilitates scalability of gains to larger datasets and models compared to the straightforward $n$-gram regularization method.
翻译:最大化下一个词元的似然是预训练语言模型的一种成熟且统计上稳健的目标。本文中,我们证明通过使用一个聚合的$n$元语法分布对语料库进行预聚合,可以更快地训练出更好的模型。先前的研究已提出使用语料库级别的$n$元语法统计作为正则化器;然而,如果采用朴素的方法构建和查询此类$n$元语法,其成本高昂且会显著降低训练速度,从而限制了其在现代大规模语言模型预训练中的应用。我们引入了一种替代的、紧凑的下一个词元分布表示,该表示在期望上与完整的$n$元语法分布一致,同时与标准的下一词元损失相比,显著降低了跨小批量的方差。实证结果表明,与现有方法相比,无论是$n$元语法正则化模型还是我们的近似方法,都能在模型质量和收敛速度上带来显著提升。此外,与直接的$n$元语法正则化方法相比,我们的近似方法有助于将性能增益扩展到更大的数据集和模型上。