Generative language models are usually pretrained on large text corpus via predicting the next token (i.e., sub-word/word/phrase) given the previous ones. Recent works have demonstrated the impressive performance of large generative language models on downstream tasks. However, existing generative language models generally neglect an inherent challenge in text corpus during training, i.e., the imbalance between frequent tokens and infrequent ones. It can lead a language model to be dominated by common and easy-to-learn tokens, thereby overlooking the infrequent and difficult-to-learn ones. To alleviate that, we propose an Information Entropy Loss (InfoEntropy Loss) function. During training, it can dynamically assess the learning difficulty of a to-be-learned token, according to the information entropy of the corresponding predicted probability distribution over the vocabulary. Then it scales the training loss adaptively, trying to lead the model to focus more on the difficult-to-learn tokens. On the Pile dataset, we train generative language models at different scales of 436M, 1.1B, and 6.7B parameters. Experiments reveal that models incorporating the proposed InfoEntropy Loss can gain consistent performance improvement on downstream benchmarks.
翻译:生成式语言模型通常通过预测给定前文的下一个标记(即子词/词/短语)在大规模文本语料库上进行预训练。近期研究展示了大型生成式语言模型在下游任务中的卓越性能。然而,现有生成式语言模型在训练中普遍忽略了文本语料库的固有挑战——高频标记与低频标记之间的不平衡。这可能导致语言模型被常见且易学的标记主导,从而忽略低频且难学的标记。为解决此问题,我们提出一种信息熵损失函数(InfoEntropy Loss)。在训练过程中,该函数能根据对应预测概率分布(在词表上)的信息熵,动态评估待学标记的学习难度,并自适应地缩放训练损失,引导模型更关注难学标记。在Pile数据集上,我们训练了参数规模分别为436M、1.1B和6.7B的生成式语言模型。实验表明,融入所提出的InfoEntropy Loss的模型在下游基准测试中可获得一致的性能提升。