Neural language models are probabilistic models of human text. They are predominantly trained using maximum likelihood estimation (MLE), which is equivalent to minimizing the forward cross-entropy between the empirical data distribution and the model distribution. However, various degeneration phenomena are still widely observed when decoding from the distributions learned by such models. We establish that the forward cross-entropy is suboptimal as a distance metric for aligning human and model distribution due to its (1) recall-prioritization (2) negative diversity ignorance and (3) train-test mismatch. In this paper, we propose Earth Mover Distance Optimization (EMO) for auto-regressive language modeling. EMO capitalizes on the inherent properties of earth mover distance to address the aforementioned challenges. Due to the high complexity of direct computation, we further introduce a feasible upper bound for EMO to ease end-to-end training. Upon extensive evaluation of language models trained using EMO and MLE. We find that EMO demonstrates a consistently better language modeling performance than MLE across domains. Moreover, EMO demonstrates noteworthy enhancements in downstream performance with minimal fine-tuning on merely 25,000 sentences. This highlights the tremendous potential of EMO as a lightweight calibration method for enhancing large-scale pre-trained language models.
翻译:摘要:神经语言模型是描述人类文本的概率模型。目前主流的训练方式是基于最大似然估计(MLE),其等价于最小化经验数据分布与模型分布之间的前向交叉熵。然而,当从这类模型学到的分布进行解码时,仍普遍观察到各种退化现象。我们论证,前向交叉熵在优化人类分布与模型分布对齐时存在三方面缺陷:(1)召回优先性偏差,(2)负多样性忽略,以及(3)训练测试不匹配,因此作为距离度量并非最优选择。本文提出面向自回归语言建模的地球移动距离优化(EMO)方法。该方法利用地球移动距离的固有能力应对上述挑战。针对直接计算复杂度高的问题,我们进一步推导了EMO的可行上界以实现端到端训练。通过对比采用EMO与MLE训练的语言模型,我们发现在不同领域上EMO均能持续获得更优的语言建模性能。此外,仅需在25,000句子上进行最小程度的微调,EMO便能在下游任务中展现显著性能提升。这凸显了EMO作为轻量级校准方法在增强大规模预训练语言模型方面的巨大潜力。