Efficient Training of Language Models with Compact and Consistent Next Token Distributions

Maximizing the likelihood of the next token is an established, statistically sound objective for pre-training language models. In this paper we show that we can train better models faster by pre-aggregating the corpus with a collapsed $n$-gram distribution. Previous studies have proposed corpus-level $n$-gram statistics as a regularizer; however, the construction and querying of such $n$-grams, if done naively, prove to be costly and significantly impede training speed, thereby limiting their application in modern large language model pre-training. We introduce an alternative compact representation of the next token distribution that, in expectation, aligns with the complete $n$-gram distribution while markedly reducing variance across mini-batches compared to the standard next-token loss. Empirically, we demonstrate that both the $n$-gram regularized model and our approximation yield substantial improvements in model quality and convergence rate compared to existing methods. Furthermore, our approximation facilitates scalability of gains to larger datasets and models compared to the straightforward $n$-gram regularization method.

翻译：最大化下一个词元的似然是预训练语言模型的一种成熟且统计上稳健的目标。本文中，我们证明通过使用一个聚合的$n$元语法分布对语料库进行预聚合，可以更快地训练出更好的模型。先前的研究已提出使用语料库级别的$n$元语法统计作为正则化器；然而，如果采用朴素的方法构建和查询此类$n$元语法，其成本高昂且会显著降低训练速度，从而限制了其在现代大规模语言模型预训练中的应用。我们引入了一种替代的、紧凑的下一个词元分布表示，该表示在期望上与完整的$n$元语法分布一致，同时与标准的下一词元损失相比，显著降低了跨小批量的方差。实证结果表明，与现有方法相比，无论是$n$元语法正则化模型还是我们的近似方法，都能在模型质量和收敛速度上带来显著提升。此外，与直接的$n$元语法正则化方法相比，我们的近似方法有助于将性能增益扩展到更大的数据集和模型上。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日