Generative language models are usually pretrained on large text corpus via predicting the next token (i.e., sub-word/word/phrase) given the previous ones. Recent works have demonstrated the impressive performance of large generative language models on downstream tasks. However, existing generative language models generally neglect an inherent challenge in text corpus during training, i.e., the imbalance between frequent tokens and infrequent ones. It can lead a language model to be dominated by common and easy-to-learn tokens, thereby overlooking the infrequent and difficult-to-learn ones. To alleviate that, we propose a MiLe Loss function for mitigating the bias of learning difficulties with tokens. During training, it can dynamically assess the learning difficulty of a to-be-learned token, according to the information entropy of the corresponding predicted probability distribution over the vocabulary. Then it scales the training loss adaptively, trying to lead the model to focus more on the difficult-to-learn tokens. On the Pile dataset, we train generative language models at different scales of 468M, 1.2B, and 6.7B parameters. Experiments reveal that models incorporating the proposed MiLe Loss can gain consistent performance improvement on downstream benchmarks.
翻译:生成式语言模型通常通过给定前文预测下一个标记(即子词/词/短语)的方式在大规模文本语料上进行预训练。近期研究已证明大型生成式语言模型在下游任务中具有卓越性能。然而,现有生成式语言模型普遍忽视了训练过程中文本语料的一个固有挑战:高频标记与低频标记之间的不平衡性。这会导致语言模型被常见且易于学习的标记所主导,从而忽视低频且难以学习的标记。为缓解此问题,我们提出一种用于缓解标记学习难度偏差的MiLe损失函数。该函数在训练过程中能依据词汇表上对应预测概率分布的信息熵,动态评估待学习标记的学习难度,进而自适应地调整训练损失的权重,引导模型更专注于难以学习的标记。基于Pile数据集,我们训练了参数量分别为468M、1.2B和6.7B的不同规模生成式语言模型。实验表明,采用所提MiLe损失函数的模型能在下游基准测试中获得持续的性能提升。