Despite their growing capabilities, language models still frequently reproduce content from their training data, generate repetitive text, and favor common grammatical patterns and vocabulary. A possible cause is the decoding strategy: the most common strategies either consider only the most probable tokens, which reduces output diversity, or increase the likelihood of unlikely tokens, compromising output accuracy and correctness. In this paper, we propose DiffSampling, a new decoding method that leverages a mathematical analysis of the token probability distribution to ensure the generation of contextually appropriate text. In particular, the difference between consecutive, sorted probabilities can be used to truncate incorrect tokens. In addition, we also propose two variations of the proposed method that aim to correct the subtle inconsistencies of common sampling strategies. Experiments involving four different text-generation tasks demonstrate that our approach consistently performs at least on par with the existing methods it builds upon in terms of quality, despite sampling from a larger set of tokens.
翻译:尽管语言模型的能力日益增强,但其仍频繁复现训练数据中的内容、生成重复文本,并倾向于使用常见的语法模式和词汇。一个可能的原因是解码策略:最常见的策略要么仅考虑概率最高的词元,这会降低输出多样性;要么增加低概率词元的可能性,从而损害输出的准确性与正确性。本文提出DiffSampling,一种新的解码方法,该方法利用对词元概率分布的数学分析来确保生成上下文恰当的文本。具体而言,连续排序概率之间的差异可用于截断错误词元。此外,我们还提出了所提方法的两种变体,旨在修正常见采样策略中的细微不一致性。在涉及四种不同文本生成任务的实验中,尽管从更大的词元集合中采样,我们的方法在生成质量上始终至少与所基于的现有方法表现相当。