LLM text decoding is key component for perceived LLM quality. We demonstrate two experiments showing that decoding methods could be improved by manipulation of token probabilities. First, we test few LLM on SummEval summary scoring dataset, to measure reading comprehension. We compare scores from greedy decoding to expected values over the next token distribution. We scale logits by large temperature to increase the entropy of scores. This allows strong improvement of performance on SummEval (in terms of correlations to human judgement). We see improvement from 6-8% to 13-28% for 7B Mistral and from 20%-46% to 37%-56% for Mixtral, beating GPT 4 0314 result on two metrics. Part of the gain seems related to positional bias. Secondly, we use probability-based tree sampling algorithm, to examine all most probable generations for given prompt.
翻译:LLM文本解码是影响大语言模型感知质量的关键组成部分。我们通过两项实验证明,通过对词元概率进行操作可以改进解码方法。首先,我们在SummEval摘要评分数据集上测试了若干LLM,以评估其阅读理解能力。我们将贪婪解码得到的分数与基于下一个词元分布计算的期望值进行对比。通过采用大温度系数对逻辑值进行缩放,以增加评分的熵值。该方法在SummEval数据集上实现了显著的性能提升(以与人类判断的相关性衡量)。7B参数Mistral模型的相关系数从6-8%提升至13-28%,Mixtral模型从20%-46%提升至37%-56%,在两项指标上超越了GPT-4 0314版本的结果。部分性能增益似乎与位置偏差有关。其次,我们采用基于概率的树采样算法,对给定提示下所有最可能的生成结果进行了系统性考察。