Neural language models are probabilistic models of human text. They are predominantly trained using maximum likelihood estimation (MLE), which is equivalent to minimizing the forward cross-entropy between the empirical data distribution and the model distribution. However, various degeneration phenomena are still widely observed when decoding from the distributions learned by such models. We establish that the forward cross-entropy is suboptimal as a distance metric for aligning human and model distribution due to its (1) recall-prioritization (2) negative diversity ignorance and (3) train-test mismatch. In this paper, we propose Earth Mover Distance Optimization (EMO) for auto-regressive language modeling. EMO capitalizes on the inherent properties of earth mover distance to address the aforementioned challenges. Due to the high complexity of direct computation, we further introduce a feasible upper bound for EMO to ease end-to-end training. Upon extensive evaluation of language models trained using EMO and MLE. We find that EMO demonstrates a consistently better language modeling performance than MLE across domains. Moreover, EMO demonstrates noteworthy enhancements in downstream performance with minimal fine-tuning on merely 25,000 sentences. This highlights the tremendous potential of EMO as a lightweight calibration method for enhancing large-scale pre-trained language models.
翻译:摘要:神经语言模型是人类文本的概率模型。它们主要通过最大似然估计(MLE)进行训练,这等价于最小化经验数据分布与模型分布之间的前向交叉熵。然而,从这类模型学习到的分布进行解码时,仍广泛观察到各种退化现象。我们论证了前向交叉熵作为对齐人类分布与模型分布的距离度量存在次优性,原因在于其(1)召回优先性、(2)负多样性忽视以及(3)训练-测试不匹配。本文提出用于自回归语言建模的地球移动距离优化(EMO)。EMO利用地球移动距离的固有属性来应对上述挑战。鉴于直接计算的高复杂度,我们进一步引入了EMO的一个可行上界以简化端到端训练。通过对使用EMO和MLE训练的语言模型进行广泛评估,我们发现EMO在跨领域的语言建模性能上始终优于MLE。此外,EMO在仅对25,000个句子进行微调的情况下,展现了下游性能的显著提升。这凸显了EMO作为一种轻量级校准方法在增强大规模预训练语言模型方面的巨大潜力。