Beyond MLE: Convex Learning for Text Generation

Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a probability distribution that best explain the observed data. In the context of text generation, MLE is often used to train generative language models, which can then be used to generate new text. However, we argue that MLE is not always necessary and optimal, especially for closed-ended text generation tasks like machine translation. In these tasks, the goal of model is to generate the most appropriate response, which does not necessarily require it to estimate the entire data distribution with MLE. To this end, we propose a novel class of training objectives based on convex functions, which enables text generation models to focus on highly probable outputs without having to estimate the entire data distribution. We investigate the theoretical properties of the optimal predicted distribution when applying convex functions to the loss, demonstrating that convex functions can sharpen the optimal distribution, thereby enabling the model to better capture outputs with high probabilities. Experiments on various text generation tasks and models show the effectiveness of our approach. It enables autoregressive models to bridge the gap between greedy and beam search, and facilitates the learning of non-autoregressive models with a maximum improvement of 9+ BLEU points. Moreover, our approach also exhibits significant impact on large language models (LLMs), substantially enhancing their generative capability on various tasks. Source code is available at \url{https://github.com/ictnlp/Convex-Learning}.

翻译：最大似然估计（MLE）是一种统计方法，用于估计最能解释观测数据的概率分布参数。在文本生成领域，MLE常被用于训练生成式语言模型，进而生成新文本。然而，我们指出，对于机器翻译等封闭式文本生成任务，MLE并非始终必要且最优。在此类任务中，模型的目标是生成最合适的响应，而无需通过MLE估计完整数据分布。为此，我们提出一类基于凸函数的新型训练目标，使文本生成模型能够专注于高概率输出，而无需估计完整数据分布。我们研究了将凸函数应用于损失函数时最优预测分布的理论性质，证明凸函数能锐化最优分布，从而帮助模型更好地捕捉高概率输出。在多种文本生成任务和模型上的实验验证了本方法的有效性。该方法使自回归模型能弥合贪心搜索与束搜索之间的差距，并促进非自回归模型的学习，最大提升超过9个BLEU点。此外，本方法对大型语言模型（LLM）也具有显著影响，能大幅提升其在多种任务上的生成能力。源代码已公开于 \url{https://github.com/ictnlp/Convex-Learning}。

相关内容

极大似然估计

关注 5

极大似然估计方法（Maximum Likelihood Estimate，MLE）也称为最大概似估计或最大似然估计，是求估计的另一种方法，最大概似是1821年首先由德国数学家高斯（C. F. Gauss）提出，但是这个方法通常被归功于英国的统计学家罗纳德·费希尔（R. A. Fisher）它是建立在极大似然原理的基础上的一个统计方法，极大似然原理的直观想法是，一个随机试验如有若干个可能的结果A，B，C，... ，若在一次试验中，结果A出现了，那么可以认为实验条件对A的出现有利，也即出现的概率P(A)较大。极大似然原理的直观想法我们用下面例子说明。设甲箱中有99个白球，1个黑球；乙箱中有1个白球．99个黑球。现随机取出一箱，再从抽取的一箱中随机取出一球，结果是黑球，这一黑球从乙箱抽取的概率比从甲箱抽取的概率大得多，这时我们自然更多地相信这个黑球是取自乙箱的。一般说来，事件A发生的概率与某一未知参数theta有关， theta取值不同，则事件A发生的概率P(A/theta)也不同，当我们在一次试验中事件A发生了，则认为此时的theta值应是t的一切可能取值中使P(A/theta)达到最大的那一个，极大似然估计法就是要选取这样的t值作为参数t的估计值，使所选取的样本在被选的总体中出现的可能性为最大。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日