Autoregressive language models are trained by minimizing the cross-entropy of the model distribution Q relative to the data distribution P -- that is, minimizing the forward cross-entropy, which is equivalent to maximum likelihood estimation (MLE). We have observed that models trained in this way may "over-generalize", in the sense that they produce non-human-like text. Moreover, we believe that reverse cross-entropy, i.e., the cross-entropy of P relative to Q, is a better reflection of how a human would evaluate text generated by a model. Hence, we propose learning with MixCE, an objective that mixes the forward and reverse cross-entropies. We evaluate models trained with this objective on synthetic data settings (where P is known) and real data, and show that the resulting models yield better generated text without complex decoding strategies. Our code and models are publicly available at https://github.com/bloomberg/mixce-acl2023
翻译:自回归语言模型通常通过最小化模型分布Q相对于数据分布P的交叉熵进行训练——即最小化前向交叉熵,这等价于最大似然估计(MLE)。我们观察到,以这种方式训练的模型可能产生“过度泛化”现象,即生成非类人文本。此外,我们认为反向交叉熵(即P相对于Q的交叉熵)能更好地反映人类评估模型生成文本的方式。因此,我们提出使用MixCE学习目标,该目标混合了前向和反向交叉熵。我们在合成数据场景(其中P已知)和真实数据上评估了使用此目标训练的模型,结果表明,生成的模型无需复杂解码策略即可生成更优质的文本。我们的代码和模型公开于 https://github.com/bloomberg/mixce-acl2023