Despite tremendous improvements in natural language generation, summarization models still suffer from the unfaithfulness issue. Previous work evaluates faithfulness either using models trained on the other tasks or in-domain synthetic data, or prompting a large model such as ChatGPT. This paper proposes to do zero-shot faithfulness evaluation simply with a moderately-sized foundation language model. We introduce a new metric FFLM, which is a combination of probability changes based on the intuition that prefixing a piece of text that is consistent with the output will increase the probability of predicting the output. Experiments show that FFLM performs competitively with or even outperforms ChatGPT on both inconsistency detection and faithfulness rating with 24x fewer parameters. FFLM also achieves improvements over other strong baselines.
翻译:尽管自然语言生成取得了巨大进步,摘要模型仍存在忠实度不足的问题。以往的工作要么利用其他任务或领域内合成数据训练的模型进行忠实度评估,要么通过提示大型模型(如ChatGPT)来实现。本文提出仅需使用中等规模的基础语言模型即可进行零样本忠实度评估。我们引入了一种新度量FFLM,其基于直觉——前缀一段与输出一致的文本将提升预测输出的概率,通过组合概率变化实现。实验表明,FFLM在参数减少24倍的情况下,在一致性检测和忠实度评分任务中均能取得与ChatGPT相当甚至更优的性能。此外,FFLM相较于其他强基线模型也实现了显著改进。