Factual inconsistency poses a significant hurdle for the commercial deployment of abstractive summarizers. Under this Large Language Model (LLM) era, this work focuses around two important questions: what is the best way to leverage LLM for factual inconsistency detection, and how could we distill a smaller LLM with both high efficiency and efficacy? Three zero-shot paradigms are firstly proposed and evaluated across five diverse datasets: direct inference on the entire summary or each summary window; entity verification through question generation and answering. Experiments suggest that LLM itself is capable to resolve this task train-free under the proper paradigm design, surpassing strong trained baselines by 2.8% on average. To further promote practical utility, we then propose training strategies aimed at distilling smaller open-source LLM that learns to score the entire summary at once with high accuracy, which outperforms the zero-shot approaches by much larger LLM, serving as an effective and efficient ready-to-use scorer.
翻译:事实不一致性对抽象式摘要生成器的商业部署构成了重大障碍。在大语言模型时代,本研究聚焦于两个重要问题:如何最佳利用大语言模型检测事实不一致性,以及如何蒸馏出一个兼具高效率与高效能的小型大语言模型?我们首先提出三种零样本范式,并在五个多样化数据集上进行评估:直接推理整个摘要或每个摘要窗口;通过问题生成与回答进行实体验证。实验表明,在恰当的范式设计下,大语言模型本身能够在无需训练的情况下解决此任务,平均超过强训练基线2.8%。为进一步提升实用性,我们随后提出训练策略,旨在蒸馏出能够一次性高精度评分整个摘要的小型开源大语言模型,其性能优于采用更大大语言模型的零样本方法,成为高效且有效的即用型评分器。