Generative language models are usually pretrained on large text corpus via predicting the next token (i.e., sub-word/word/phrase) given the previous ones. Recent works have demonstrated the impressive performance of large generative language models on downstream tasks. However, existing generative language models generally neglect an inherent challenge in text corpus during training, i.e., the imbalance between frequent tokens and infrequent ones. It can lead a language model to be dominated by common and easy-to-learn tokens, thereby overlooking the infrequent and difficult-to-learn ones. To alleviate that, we propose an Information Entropy Loss (InfoEntropy Loss) function. During training, it can dynamically assess the learning difficulty of a to-be-learned token, according to the information entropy of the corresponding predicted probability distribution over the vocabulary. Then it scales the training loss adaptively, trying to lead the model to focus more on the difficult-to-learn tokens. On the Pile dataset, we train generative language models at different scales of 468M, 1.2B, and 6.7B parameters. Experiments reveal that models incorporating the proposed InfoEntropy Loss can gain consistent performance improvement on downstream benchmarks.
翻译:生成式语言模型通常通过给定前文预测下一个标记(即子词/词/短语)的方式,在大规模文本语料库上进行预训练。近期研究表明,大型生成式语言模型在下游任务中展现出令人瞩目的性能。然而,现有生成式语言模型在训练过程中普遍忽略了文本语料库中一个固有挑战,即高频标记与低频标记之间的不平衡性。这种不平衡可能导致语言模型被常见的易学标记主导,从而忽视低频难学标记。为解决该问题,我们提出一种信息熵损失(InfoEntropy Loss)函数。该函数在训练过程中可根据待学习标记对应预测概率分布的信息熵,动态评估其学习难度,并通过自适应缩放训练损失,引导模型更加关注难学标记。在Pile数据集上,我们训练了参数规模分别为468M、1.2B和6.7B的生成式语言模型。实验表明,采用所提出信息熵损失函数的模型在下游基准测试中均能获得一致性的性能提升。