Language generation based on maximum likelihood estimation (MLE) has become the fundamental approach for text generation. Maximum likelihood estimation is typically performed by minimizing the log-likelihood loss, also known as the logarithmic score in statistical decision theory. The logarithmic score is strictly proper in the sense that it encourages honest forecasts, where the expected score is maximized only when the model reports true probabilities. Although many strictly proper scoring rules exist, the logarithmic score is the only local scoring rule among them that depends exclusively on the probability of the observed sample, making it capable of handling the exponentially large sample space of natural text. In this work, we propose a straightforward strategy for adapting scoring rules to language generation, allowing for language modeling with any non-local scoring rules. Leveraging this strategy, we train language generation models using two classic strictly proper scoring rules, the Brier score and the Spherical score, as alternatives to the logarithmic score. Experimental results indicate that simply substituting the loss function, without adjusting other hyperparameters, can yield substantial improvements in model's generation capabilities. Moreover, these improvements can scale up to large language models (LLMs) such as LLaMA-7B and LLaMA-13B. Source code: \url{https://github.com/shaochenze/ScoringRulesLM}.
翻译:基于最大似然估计(MLE)的语言生成已成为文本生成的基本方法。最大似然估计通常通过最小化对数似然损失(在统计决策理论中也称为对数评分)来实现。对数评分在严格适当的意义上鼓励诚实的预测,即只有当模型报告真实概率时,期望得分才达到最大。尽管存在许多严格适当的评分规则,但对数评分是其中唯一的局部评分规则,它仅依赖于观测样本的概率,使其能够处理自然文本指数级大的样本空间。在本工作中,我们提出了一种将评分规则适配到语言生成的直接策略,使得可以使用任何非局部评分规则进行语言建模。利用这一策略,我们使用两种经典的严格适当评分规则——Brier评分和Spherical评分——作为对数评分的替代方案来训练语言生成模型。实验结果表明,仅替换损失函数而不调整其他超参数,即可显著提升模型的生成能力。此外,这些改进可以扩展到大型语言模型(LLM),如LLaMA-7B和LLaMA-13B。源代码:\url{https://github.com/shaochenze/ScoringRulesLM}。