Evaluating the Factual Consistency of Large Language Models Through News Summarization

While large language models (LLMs) have proven to be effective on a large variety of tasks, they are also known to hallucinate information. To measure whether an LLM prefers factually consistent continuations of its input, we propose a new benchmark called FIB(Factual Inconsistency Benchmark) that focuses on the task of summarization. Specifically, our benchmark involves comparing the scores an LLM assigns to a factually consistent versus a factually inconsistent summary for an input news article. For factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent. To generate summaries that are factually inconsistent, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent. A model's factual consistency is then measured according to its accuracy, i.e.\ the proportion of documents where it assigns a higher score to the factually consistent summary. To validate the usefulness of FIB, we evaluate 23 large language models ranging from 1B to 176B parameters from six different model families including BLOOM and OPT. We find that existing LLMs generally assign a higher score to factually consistent summaries than to factually inconsistent summaries. However, if the factually inconsistent summaries occur verbatim in the document, then LLMs assign a higher score to these factually inconsistent summaries than factually consistent summaries. We validate design choices in our benchmark including the scoring method and source of distractor summaries. Our code and benchmark data can be found at https://github.com/r-three/fib.

翻译：尽管大语言模型（LLMs）在众多任务上表现出色，但其生成虚假信息（即"幻觉"现象）的问题也备受关注。为衡量LLM是否倾向于生成与输入内容事实一致的续写，我们提出名为FIB（事实不一致基准测试）的新基准，聚焦摘要任务。具体而言，该基准通过比较LLM对同一输入新闻文章的事实一致性摘要与事实不一致摘要的评分来实现评估。事实一致性摘要采用经人工验证的事实一致的人写参考摘要；事实不一致摘要则选取来自多个摘要模型生成且经人工标注为事实不一致的摘要。模型的事实一致性通过其准确性衡量，即模型对事实一致性摘要给出更高评分的文档比例。为验证FIB的有效性，我们评估了来自BLOOM、OPT等六个模型家族的23个大语言模型（参数量从1B到176B不等）。研究发现：现有LLM通常对事实一致性摘要的评分高于事实不一致摘要；但当事实不一致摘要与原文逐字重复时，LLM反而会对这些事实不一致摘要给出更高评分。我们还验证了基准中的评分方法及干扰项摘要来源等设计选择。相关代码与基准数据详见https://github.com/r-three/fib。

相关内容

大语言模型

关注 67

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日