The standard paradigm of neural language generation adopts maximum likelihood estimation (MLE) as the optimizing method. From a distributional view, MLE in fact minimizes the Kullback-Leibler divergence (KLD) between the distribution of the real data and that of the model. However, this approach forces the model to distribute non-zero (sometimes large) probability mass to all training samples regardless of their quality. Moreover, in the attempt to cover the low-probability regions in the data distribution, the model systematically overestimates the probability of corrupted text sequences, which we conjecture is one of the main reasons for text degeneration during autoregressive decoding. To remedy this problem, we leverage the total variation distance (TVD) with its robustness to outliers, and develop practical bounds to apply it to language generation. Then, we introduce the TaiLr objective that balances the tradeoff of estimating TVD. Intuitively, TaiLr downweights real data samples that have low model probabilities with tunable penalization intensity. Experimental results show that our method alleviates the overestimation of degenerated sequences without sacrificing diversity and improves generation quality on a wide range of text generation tasks.
翻译:神经语言生成的标准范式采用最大似然估计(MLE)作为优化方法。从分布的角度看,MLE实际上是在最小化真实数据分布与模型分布之间的Kullback-Leibler散度(KLD)。然而,这种方法迫使模型对所有训练样本(无论其质量如何)分配非零(有时甚至很大)的概率质量。此外,在试图覆盖数据分布中低概率区域时,模型系统性地高估了损坏文本序列的概率,我们推测这是自回归解码过程中文本退化的重要原因之一。为解决这一问题,我们利用总变差距离(TVD)对异常值的鲁棒性,开发了将其应用于语言生成的实用边界。随后,我们引入了TaiLr目标函数,该函数平衡了TVD估计的权衡。直观上,TaiLr通过可调节的惩罚强度降低模型概率较低的真实数据样本的权重。实验结果表明,我们的方法在保持多样性的同时缓解了对退化序列的高估问题,并在广泛的文本生成任务中提升了生成质量。