Neural language models are probabilistic models of human text. They are predominantly trained using maximum likelihood estimation (MLE), which is equivalent to minimizing the forward cross-entropy between the empirical data distribution and the model distribution. However, various degeneration phenomena are still widely observed when decoding from the distributions learned by such models. We establish that the forward cross-entropy is suboptimal as a distance metric for aligning human and model distribution due to its (1) recall-prioritization (2) negative diversity ignorance and (3) train-test mismatch. In this paper, we propose Earth Mover Distance Optimization (EMO) for auto-regressive language modeling. EMO capitalizes on the inherent properties of earth mover distance to address the aforementioned challenges. Due to the high complexity of direct computation, we further introduce a feasible upper bound for EMO to ease end-to-end training. Upon extensive evaluation of language models trained using EMO and MLE. We find that EMO demonstrates a consistently better language modeling performance than MLE across domains. Moreover, EMO demonstrates noteworthy enhancements in downstream performance with minimal fine-tuning on merely 25,000 sentences. This highlights the tremendous potential of EMO as a lightweight calibration method for enhancing large-scale pre-trained language models.
翻译:神经语言模型是描述人类文本的概率模型。它们主要使用最大似然估计(MLE)进行训练,这等价于最小化经验数据分布与模型分布之间的前向交叉熵。然而,当从这类模型学习到的分布进行解码时,各种退化现象仍被广泛观察到。我们证明,前向交叉熵作为对齐人类分布与模型分布的距离度量存在次优性,原因在于其(1)召回优先性(2)负多样性忽视以及(3)训练-测试不匹配。本文提出用于自回归语言建模的推土机距离优化(EMO)。EMO利用推土机距离的固有特性来解决上述挑战。由于直接计算的复杂度较高,我们进一步引入EMO的可行上界以简化端到端训练。通过对使用EMO和MLE训练的语言模型进行全面评估,我们发现EMO在语言建模性能上跨领域持续优于MLE。此外,EMO在仅对25,000个句子进行微调后,在下游任务中展现出显著的性能提升。这突显了EMO作为轻量级校准方法在增强大规模预训练语言模型方面的巨大潜力。