The performance of abstractive text summarization has been greatly boosted by pre-trained language models recently. The main concern of existing abstractive summarization methods is the factual inconsistency problem of their generated summary. To alleviate the problem, many efforts have focused on developing effective factuality evaluation metrics based on natural language inference and question answering et al. However, they have limitations of high computational complexity and relying on annotated data. Most recently, large language models such as ChatGPT have shown strong ability in not only natural language understanding but also natural language inference. In this paper, we study the factual inconsistency evaluation ability of ChatGPT under the zero-shot setting by evaluating it on the coarse-grained and fine-grained factuality evaluation tasks including binary natural language inference (NLI), summary ranking, and consistency rating. Experimental results show that ChatGPT outperforms previous SOTA evaluation metrics on 6/9 datasets across three tasks, demonstrating its great potential for assessing factual inconsistency in the zero-shot setting. The results also highlight the importance of prompt design and the need for future efforts to address ChatGPT's limitations on evaluation bias, wrong reasoning, and hallucination.
翻译:近年来,预训练语言模型极大地提升了抽象式文本摘要的性能。现有抽象式摘要方法的主要关注点在于其生成摘要中事实不一致的问题。为缓解该问题,许多研究致力于开发基于自然语言推理和问答等技术的有效性事实评估指标。然而,这些方法存在计算复杂度高且依赖标注数据的局限性。最近,像ChatGPT这样的大语言模型不仅在自然语言理解方面,还在自然语言推理领域展现出强大的能力。本文通过在粗粒度和细粒度事实评估任务(包括二值自然语言推理、摘要排序和一致性评级)上对ChatGPT进行零样本设置下的评估,研究了其事实不一致性评估能力。实验结果表明,在三个任务的9个数据集中的6个上,ChatGPT优于之前的SOTA评估指标,展示了其在零样本设置下评估事实不一致性的巨大潜力。同时,结果也突显了提示设计的重要性,以及未来需要解决ChatGPT在评估偏差、错误推理和幻觉方面局限性的必要性。